zomi-language-corpora/English-Zomi-OPUS_Tatoeba_v20230412
收藏Hugging Face2026-04-28 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/zomi-language-corpora/English-Zomi-OPUS_Tatoeba_v20230412
下载链接
链接失效反馈官方服务:
资源简介:
该数据集包含178万英语-Zomi句子对,旨在支持机器翻译、语言学研究和大规模语言模型训练。它完全开放,允许商业和非商业使用。Zomi是这种语言的自称,但目前尚无官方ISO 639-3代码,因此暂时使用ctd(Tedim Chin)作为兼容代码。数据集来源于OPUS Tatoeba v20230412,经过去重后得到1,778,043个独特的英语句子,并与Zomi翻译对齐。数据集采用CC0-1.0许可,适合用于机器翻译训练、LLM预训练和微调、跨语言研究、低资源语言建模以及Zomi及相关语言的 linguistic分析。
This dataset contains 1.78 million English–Zomi sentence pairs, created to support machine translation, linguistic research, and large‑scale language model training. It is fully open and permissively licensed for commercial and non‑commercial use. Zomi is the endonym of the language, but it does not yet have an official ISO 639-3 code, so ctd (Tedim Chin) is used for compatibility. The dataset is derived from OPUS Tatoeba v20230412, with deduplicated English sentences (1,778,043 unique sentences) aligned with Zomi translations. Released under CC0-1.0, it is suitable for training MT systems, pretraining and fine-tuning multilingual LLMs, cross-lingual research, low-resource language modeling, and linguistic analysis of Zomi and related languages.
提供机构:
zomi-language-corpora



