ZomiLearner/English-Zomi-OPUS_Tatoeba_v20230412
收藏Hugging Face2026-02-26 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/ZomiLearner/English-Zomi-OPUS_Tatoeba_v20230412
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
- ctd
license: cc0-1.0
tags:
- parallel-corpus
- machine-translation
- english-zomi
- zomi
- zomi_Latn
- llm-training
- nlp
- opus
- open-license
size_categories:
- 1M<n<10M
task_categories:
- translation
pretty_name: English–Zomi Parallel Corpus (1.78M)
---
# English–Zomi Parallel Corpus (1.78M)
This dataset contains **1.78 million English–Zomi sentence pairs**, created to support
machine translation, linguistic research, and large‑scale language model training.
It is fully open and permissively licensed for **commercial and non‑commercial use**.
---
## 🌐 Linguistic Background: Zomi, Tedim Chin, and ISO Codes
**Zomi** is the endonym (self‑chosen name) of the people and their language.
However, Zomi does **not yet have an official ISO 639‑3 code**. Until the ISO code for Zomi becomes available, this dataset will use **ctd** *(Tedim Chin)* for technical compatibility while acknowledging that **Zomi** is the correct endonym.
- **ISO 639‑3: ctd** — currently used for compatibility
- **ISO 639‑3: zol** — *not yet official* (pending)
- **Language Name (endonym): Zomi**
### Why ctd?
The term *Tedim Chin* is an **exonym**, not the community’s own name. It was assigned by external authorities and missionaries for administrative convenience, particularly during colonial times. Because of historical “divide and rule” policies, the Zomi people have been grouped under various imposed names that do not reflect their own identity.
---
## 📦 Dataset Summary
- **1.78M English–Zomi parallel sentence pairs**
- English source: OPUS Tatoeba v20230412
(https://object.pouta.csc.fi/OPUS-Tatoeba/v2023-04-12/mono/en.txt.gz)
- Original English sentences: **1,830,223**
- After deduplication: **1,778,043 unique English sentences**
- Zomi translations aligned to each English sentence
- Stored in **multiple Parquet shards** for efficient loading
- Released under **CC0‑1.0** (public domain)
This dataset is suitable for:
- Machine translation (MT) training
- LLM pretraining and fine‑tuning
- Cross‑lingual research
- Low‑resource language modeling
- Linguistic analysis of Zomi and related languages
---
## 🔍 Source and Attribution
### English Source
English sentences are derived from **OPUS Tatoeba v20230412**, a publicly available multilingual corpus.
The dataset was deduplicated to remove repeated English sentences, resulting in **1,778,043 unique English sentences**.
### Zomi Translations
Zomi translations were created and aligned to the deduplicated English sentences.
Each row in the Parquet dataset contains a single aligned English–Zomi sentence pair.
---
## 🪪 Licensing
This dataset is released under **CC0‑1.0**, placing it in the public domain.
This means:
- ✔ Free for commercial use
- ✔ Free for research
- ✔ Free for redistribution
- ✔ Free for training LLMs (Meta, Google, OpenAI, Amazon, etc.)
- ✔ No attribution required
*See the LICENSE file for full details.*
---
## 📈 Intended Use
- Training English ↔ Zomi MT systems
- Pretraining or fine‑tuning multilingual LLMs
- Benchmarking low‑resource translation
- Linguistic and academic research
---
## ⚠️ Limitations
- Automatically aligned; may contain alignment noise
- Zomi orthography may vary depending on source conventions
- Not filtered for sensitive or offensive content
---
## 📜 Citation
If you use this dataset in research, you may cite it as:
```bibtex
@dataset{english_zomi_parallel_corpus_2024,
title = {English–Zomi Parallel Corpus (1.7M)},
author = {ZomiLearner},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/datasets/ZomiLearner/English-Zomi-OPUS_Tatoeba_v20230412},
note = {Released under CC0-1.0}
}
Attribution is optional due to the CC0 license.
## 🐍 Python Example
```python
from datasets import load_dataset
# Load the full training split
ds = load_dataset("ZomiLearner/English-Zomi-OPUS_Tatoeba_v20230412", split="train")
# Inspect a few examples
print(ds)
print(ds[0])
# Access columns
english_sentence = ds[0]["en"]
zomi_sentence = ds[0]["zom"]
print("EN:", english_sentence)
print("ZOM:", zomi_sentence)
# Iterate through the dataset
for row in ds.select(range(5)):
print(row["en"], " → ", row["zom"])
```
## 📚 Glossary
**Endonym** — A name used by a group of people to refer to themselves or their language.
**Exonym** — A name given to a group or language by outsiders, often for administrative or colonial purposes.
**Orthography** — The standardized system for writing a language, including spelling and conventions.
**ISO 639‑3 Code** — A three‑letter code used to uniquely identify languages in computational and linguistic systems.
**Parquet** — A columnar data format optimized for large‑scale datasets and efficient loading.
## 📝 Academic Footnote
In sociolinguistics, an **endonym** reflects a community’s self‑identity, while an **exonym** is an externally imposed label. Exonyms often arise from colonial administration, missionary activity, or political boundaries, and may not align with how communities identify themselves linguistically or culturally.
## 📚 BibTeX Citation for OPUS Tatoeba v20230412
```bibtex
@inproceedings{tiedemann2012opus,
title = {OPUS: An Open Source Parallel Corpus},
author = {Tiedemann, J{\"o}rg},
booktitle = {Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012)},
year = {2012},
pages = {2214--2218},
publisher = {European Language Resources Association (ELRA)},
url = {https://opus.nlpl.eu/},
note = {Data source: OPUS Tatoeba v20230412}
}
提供机构:
ZomiLearner



