nazarioz/changana-pt-parallel
收藏Hugging Face2026-03-22 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/nazarioz/changana-pt-parallel
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- pt
- ts
license: cc-by-4.0
task_categories:
- translation
tags:
- changana
- ronga
- xitsonga
- bantu
- low-resource
- bible
- parallel-corpus
- mozambique
pretty_name: Portuguese-Changana Parallel Corpus
size_categories:
- 1K<n<10K
---
# Portuguese–Changana Parallel Corpus
## Description
The first publicly available sentence-level parallel corpus
for Portuguese and Changana (Xichangana/Ronga), a Bantu language
spoken by approximately 3–5 million people in southern Mozambique,
South Africa, and Zimbabwe.
## Source
The corpus was constructed by aligning two Bible translations
at the verse level:
- **Portuguese:** Almeida Corrigida Fiel (ACF), published by
Sociedade Bíblica Trinitária do Brasil.
- **Changana:** Bibele hi Xizronga xa Namunhla (BRN), published
in 2021 by Dumbeka Editores e Consultores
([brn.xizronga.org](https://brn.xizronga.org)).
The corpus covers the 27 books of the New Testament.
## Statistics
| Statistic | Portuguese | Changana |
|-----------|-----------|----------|
| Aligned pairs | 7,929 | 7,929 |
| Mean sentence length (tokens) | 20.0 | 18.4 |
| Vocabulary size | 18,715 | 25,727 |
| CG/PT length ratio | 0.92 | 0.92 |
## Splits
| Split | Pairs |
|-------|-------|
| Train | 6,343 |
| Dev | 793 |
| Test | 793 |
Splits created with random shuffle (seed=42).
## Languages
- **Portuguese (pt):** Brazilian Portuguese, formal biblical register.
- **Changana (ts):** Also known as Xichangana, Xangana, or Ronga.
Bantu language (Niger-Congo family). ISO 639-1: ts (Tsonga cluster).
## Limitations
- Single domain (religious text)
- Formal/archaic register, not conversational
- Small size (7,929 pairs) by modern MT standards
## Usage
```python
from datasets import load_dataset
dataset = load_dataset("nazarioz/changana-pt-parallel")
# Access a training example
example = dataset["train"][0]
print(f"PT: {example['text_pt']}")
print(f"CG: {example['text_cg']}")
```
## Citation
If you use this dataset, please cite:
```bibtex
@misc{changana-pt-parallel-2026,
title={Portuguese--Changana Parallel Corpus},
author={Nazario},
year={2026},
url={https://huggingface.co/datasets/nazarioz/changana-pt-parallel}
}
```
## License
CC-BY-4.0
提供机构:
nazarioz



