five

ZomiLearner/English-Zomi-OPUS_Tatoeba_v20230412

收藏
Hugging Face2026-02-26 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/ZomiLearner/English-Zomi-OPUS_Tatoeba_v20230412
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en - ctd license: cc0-1.0 tags: - parallel-corpus - machine-translation - english-zomi - zomi - zomi_Latn - llm-training - nlp - opus - open-license size_categories: - 1M<n<10M task_categories: - translation pretty_name: English–Zomi Parallel Corpus (1.78M) --- # English–Zomi Parallel Corpus (1.78M) This dataset contains **1.78 million English–Zomi sentence pairs**, created to support machine translation, linguistic research, and large‑scale language model training. It is fully open and permissively licensed for **commercial and non‑commercial use**. --- ## 🌐 Linguistic Background: Zomi, Tedim Chin, and ISO Codes **Zomi** is the endonym (self‑chosen name) of the people and their language. However, Zomi does **not yet have an official ISO 639‑3 code**. Until the ISO code for Zomi becomes available, this dataset will use **ctd** *(Tedim Chin)* for technical compatibility while acknowledging that **Zomi** is the correct endonym. - **ISO 639‑3: ctd** — currently used for compatibility - **ISO 639‑3: zol** — *not yet official* (pending) - **Language Name (endonym): Zomi** ### Why ctd? The term *Tedim Chin* is an **exonym**, not the community’s own name. It was assigned by external authorities and missionaries for administrative convenience, particularly during colonial times. Because of historical “divide and rule” policies, the Zomi people have been grouped under various imposed names that do not reflect their own identity. --- ## 📦 Dataset Summary - **1.78M English–Zomi parallel sentence pairs** - English source: OPUS Tatoeba v20230412 (https://object.pouta.csc.fi/OPUS-Tatoeba/v2023-04-12/mono/en.txt.gz) - Original English sentences: **1,830,223** - After deduplication: **1,778,043 unique English sentences** - Zomi translations aligned to each English sentence - Stored in **multiple Parquet shards** for efficient loading - Released under **CC0‑1.0** (public domain) This dataset is suitable for: - Machine translation (MT) training - LLM pretraining and fine‑tuning - Cross‑lingual research - Low‑resource language modeling - Linguistic analysis of Zomi and related languages --- ## 🔍 Source and Attribution ### English Source English sentences are derived from **OPUS Tatoeba v20230412**, a publicly available multilingual corpus. The dataset was deduplicated to remove repeated English sentences, resulting in **1,778,043 unique English sentences**. ### Zomi Translations Zomi translations were created and aligned to the deduplicated English sentences. Each row in the Parquet dataset contains a single aligned English–Zomi sentence pair. --- ## 🪪 Licensing This dataset is released under **CC0‑1.0**, placing it in the public domain. This means: - ✔ Free for commercial use - ✔ Free for research - ✔ Free for redistribution - ✔ Free for training LLMs (Meta, Google, OpenAI, Amazon, etc.) - ✔ No attribution required *See the LICENSE file for full details.* --- ## 📈 Intended Use - Training English ↔ Zomi MT systems - Pretraining or fine‑tuning multilingual LLMs - Benchmarking low‑resource translation - Linguistic and academic research --- ## ⚠️ Limitations - Automatically aligned; may contain alignment noise - Zomi orthography may vary depending on source conventions - Not filtered for sensitive or offensive content --- ## 📜 Citation If you use this dataset in research, you may cite it as: ```bibtex @dataset{english_zomi_parallel_corpus_2024, title = {English–Zomi Parallel Corpus (1.7M)}, author = {ZomiLearner}, year = {2026}, publisher = {Hugging Face}, url = {https://huggingface.co/datasets/ZomiLearner/English-Zomi-OPUS_Tatoeba_v20230412}, note = {Released under CC0-1.0} } Attribution is optional due to the CC0 license. ## 🐍 Python Example ```python from datasets import load_dataset # Load the full training split ds = load_dataset("ZomiLearner/English-Zomi-OPUS_Tatoeba_v20230412", split="train") # Inspect a few examples print(ds) print(ds[0]) # Access columns english_sentence = ds[0]["en"] zomi_sentence = ds[0]["zom"] print("EN:", english_sentence) print("ZOM:", zomi_sentence) # Iterate through the dataset for row in ds.select(range(5)): print(row["en"], " → ", row["zom"]) ``` ## 📚 Glossary **Endonym** — A name used by a group of people to refer to themselves or their language. **Exonym** — A name given to a group or language by outsiders, often for administrative or colonial purposes. **Orthography** — The standardized system for writing a language, including spelling and conventions. **ISO 639‑3 Code** — A three‑letter code used to uniquely identify languages in computational and linguistic systems. **Parquet** — A columnar data format optimized for large‑scale datasets and efficient loading. ## 📝 Academic Footnote In sociolinguistics, an **endonym** reflects a community’s self‑identity, while an **exonym** is an externally imposed label. Exonyms often arise from colonial administration, missionary activity, or political boundaries, and may not align with how communities identify themselves linguistically or culturally. ## 📚 BibTeX Citation for OPUS Tatoeba v20230412 ```bibtex @inproceedings{tiedemann2012opus, title = {OPUS: An Open Source Parallel Corpus}, author = {Tiedemann, J{\"o}rg}, booktitle = {Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012)}, year = {2012}, pages = {2214--2218}, publisher = {European Language Resources Association (ELRA)}, url = {https://opus.nlpl.eu/}, note = {Data source: OPUS Tatoeba v20230412} }
提供机构:
ZomiLearner
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作