zomi-language-corpora/raw-text-corpus

Name: zomi-language-corpora/raw-text-corpus
Creator: zomi-language-corpora
Published: 2026-04-29 05:57:46
License: 暂无描述

Hugging Face2026-04-29 更新2026-05-03 收录

下载链接：

https://hf-mirror.com/datasets/zomi-language-corpora/raw-text-corpus

下载链接

链接失效反馈

官方服务：

资源简介：

Zomi Raw Text Corpus是一个开放的、社区驱动的未处理的Zomi语言文本集合。它作为构建完整Zomi NLP生态系统的基础数据集，包括分词器、语言模型、ASR/TTS系统和下游任务。数据集是原始的，未经过标准化、去重或清理。清理和特定任务的数据集将单独发布。数据集按领域组织，包括网页、新闻、圣经、书籍、社交媒体、转录文本等。每个文件都包含贡献者元数据以确保透明性和法律安全性。数据集的主要目的是保存Zomi语言的多样性，支持NLP研究和模型开发，并促进社区参与语言技术。

The Zomi Raw Text Corpus is an open, community-driven collection of unprocessed Zomi-language text. It serves as the foundational dataset for building the full Zomi NLP ecosystem, including tokenizers, language models, ASR/TTS systems, and downstream tasks. This dataset is intentionally raw — no normalization, deduplication, or cleaning is applied. Cleaned and task-specific datasets will be released separately. The dataset is organized by domain, including web, news, bible, books, social media, transcripts, etc. Each file includes contributor metadata for transparency and legal safety. The main purpose of the dataset is to preserve Zomi linguistic diversity, support NLP research and model development, and enable community participation in language technology.

提供机构：

zomi-language-corpora

5,000+

优质数据集

54 个

任务类型

进入经典数据集