five

nazarioz/changana-pt-parallel

收藏
Hugging Face2026-03-22 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/nazarioz/changana-pt-parallel
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - pt - ts license: cc-by-4.0 task_categories: - translation tags: - changana - ronga - xitsonga - bantu - low-resource - bible - parallel-corpus - mozambique pretty_name: Portuguese-Changana Parallel Corpus size_categories: - 1K<n<10K --- # Portuguese–Changana Parallel Corpus ## Description The first publicly available sentence-level parallel corpus for Portuguese and Changana (Xichangana/Ronga), a Bantu language spoken by approximately 3–5 million people in southern Mozambique, South Africa, and Zimbabwe. ## Source The corpus was constructed by aligning two Bible translations at the verse level: - **Portuguese:** Almeida Corrigida Fiel (ACF), published by Sociedade Bíblica Trinitária do Brasil. - **Changana:** Bibele hi Xizronga xa Namunhla (BRN), published in 2021 by Dumbeka Editores e Consultores ([brn.xizronga.org](https://brn.xizronga.org)). The corpus covers the 27 books of the New Testament. ## Statistics | Statistic | Portuguese | Changana | |-----------|-----------|----------| | Aligned pairs | 7,929 | 7,929 | | Mean sentence length (tokens) | 20.0 | 18.4 | | Vocabulary size | 18,715 | 25,727 | | CG/PT length ratio | 0.92 | 0.92 | ## Splits | Split | Pairs | |-------|-------| | Train | 6,343 | | Dev | 793 | | Test | 793 | Splits created with random shuffle (seed=42). ## Languages - **Portuguese (pt):** Brazilian Portuguese, formal biblical register. - **Changana (ts):** Also known as Xichangana, Xangana, or Ronga. Bantu language (Niger-Congo family). ISO 639-1: ts (Tsonga cluster). ## Limitations - Single domain (religious text) - Formal/archaic register, not conversational - Small size (7,929 pairs) by modern MT standards ## Usage ```python from datasets import load_dataset dataset = load_dataset("nazarioz/changana-pt-parallel") # Access a training example example = dataset["train"][0] print(f"PT: {example['text_pt']}") print(f"CG: {example['text_cg']}") ``` ## Citation If you use this dataset, please cite: ```bibtex @misc{changana-pt-parallel-2026, title={Portuguese--Changana Parallel Corpus}, author={Nazario}, year={2026}, url={https://huggingface.co/datasets/nazarioz/changana-pt-parallel} } ``` ## License CC-BY-4.0
提供机构:
nazarioz
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作