dsfsi/umsuka-english
收藏Hugging Face2026-03-17 更新2026-04-05 收录
下载链接:
https://hf-mirror.com/datasets/dsfsi/umsuka-english
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
- zu
license: cc-by-4.0
task_categories:
- translation
pretty_name: Umsuka English-isiZulu Parallel Corpus
tags:
- AfricaNLP
- low-resource
- isiZulu
configs:
- config_name: en-zu
data_files:
- split: train
path: data/en-zu.training.csv
- split: validation
path: data/en-zu.eval.csv
- config_name: zu-en
data_files:
- split: train
path: data/zu-en.training.csv
- split: validation
path: data/zu-en.eval.csv
---
# Umsuka English-isiZulu Parallel Corpus
## Dataset Description
- **Homepage:** https://zenodo.org/records/5035171
- **License:** [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/)
- **DOI:** 10.5281/zenodo.5035171
- **Version:** 0.1
- **Created:** March 2021
- **Curators:** Rooweither Mabuya; Jade Abbott; Vukosi Marivate
### Summary
An open-source, high quality isiZulu parallel corpus sourced from a mixture of domains, taking into account both Southern African context and international English context, translated by professional translators. It contains 5,000 English sentences translated into isiZulu and 5,000 isiZulu sentences translated into English, with 1,000 pairs per direction held out as evaluation data.
Since isiZulu is highly morphologically complex, the English-to-isiZulu evaluation set was translated at least twice by different translators to allow calculation of human-level BLEU scores.
## Configs (Subsets)
| Config | Description |
|--------|-------------|
| `en-zu` | English source sentences with isiZulu translations |
| `zu-en` | isiZulu source sentences with English translations |
## Dataset Structure
### Data Fields
- `translation` — dict with two keys:
- `en` — the English sentence
- `zu` — the isiZulu sentence
- `source` — the data source (e.g. `News Crawl 2019`, `Newspaper`)
### Data Splits
| Config | Train | Validation |
|--------|------:|----------:|
| en-zu | ~8,705 | 998 |
| zu-en | ~8,705 | 986 |
> The training set totals 9,703 sentence pairs across both directions.
## Curation Rationale
The corpus was developed to provide an open-source, high quality isiZulu parallel corpus from a mixture of domains. A pilot translation study on 500 sentences was conducted to provide feedback to translators on translation quality before the full dataset was created.
### Source Data
**English sentences** were sampled from the [News Crawl](https://data.statmt.org/news-crawl/) datasets, which are sourced from existing news sources on the internet. Note that News Crawl data reflects perspectives that skew young, white, and male — a translation model trained on such data will likely encode this hegemonic viewpoint.
**isiZulu sentences** were sourced from two corpora:
- ~2% from isiZulu short stories and novels
- ~98% from isiZulu newspaper articles (2012–2016) sampled from [Isolezwe](https://www.isolezwe.co.za/), [Ilanga](https://www.ilanganews.co.za/), and [Ezasegagasini](https://www.ezasegagasini.co.za/) (EThekwini Municipality) publications
### Translator Demographics
| Attribute | Detail |
|-----------|--------|
| Gender | 25 female, 5 male |
| Age range | 39–58 |
| Country of origin | South Africa |
| Language proficiency | Professional Translators, registered with the South African Translators Institute |
### Data Curator Demographics
**Rooweither Mabuya** — Professional isiZulu Linguist
- isiZulu and English speaker
- Compiled the isiZulu sentences for translation
- Performed quality checks on the translations
- Gave feedback to the translators during the pilot round
**Jade Abbott** — NLP Practitioner
- English speaker
- Sampled the English sentences from the Common Crawl
- Performed all data cleaning and filtering
- Performed necessary formatting to create formats usable by the translators so they could use their spell checkers
## Preprocessing
The following filtering was applied to the source and final datasets:
- English statements were removed from the isiZulu set using the `langdetect` Python package
- Duplicates were removed
- Sentences containing non-ASCII characters were removed
- Sentences fewer than 10 characters were removed
- Each line was ensured to contain only a single sentence
- Physical addresses were removed
- Email addresses were removed
Some problems were only identified after or during translation, so some sentences were lost — this is why the final counts are lower than the original target of 5,000 per direction.
## Usage
```python
from datasets import load_dataset
# English → isiZulu
en_zu = load_dataset("dsfsi/umsuka-english", "en-zu")
# isiZulu → English
zu_en = load_dataset("dsfsi/umsuka-english", "zu-en")
```
## Papers Using This Dataset
- Ngomane et al. (2023) — [Unsupervised Cross-lingual Word Embedding Representation for English-isiZulu](https://aclanthology.org/2023.rail-1.2/) — used the corpus to train cross-lingual word embeddings with VecMap for zero-shot news classification between English and isiZulu (RAIL 2023, ACL).
## Citation
```bibtex
@dataset{mabuya_umsuka_2021,
author = {Mabuya, Rooweither and Abbott, Jade and Marivate, Vukosi},
title = {Umsuka English - isiZulu Parallel Corpus},
year = {2021},
publisher = {Zenodo},
doi = {10.5281/zenodo.5035171},
url = {https://zenodo.org/records/5035171}
}
```
```bibtex
@inproceedings{ngomane-etal-2023-unsupervised,
title = "Unsupervised Cross-lingual Word Embedding Representation for {E}nglish-isi{Z}ulu",
author = "Ngomane, Derwin and Mabuya, Rooweither and Abbott, Jade and Marivate, Vukosi",
booktitle = "Proceedings of the Fourth workshop on Resources for African Indigenous Languages (RAIL 2023)",
month = may,
year = "2023",
address = "Dubrovnik, Croatia",
publisher = "Association for Computational Linguistics",
pages = "11--17",
doi = "10.18653/v1/2023.rail-1.2"
}
```
## Acknowledgements
Funded by a Facebook Research Grant. Thanks to Paco Guzman and Marc'Aurelio Ranzanto for feedback on the proposal.
提供机构:
dsfsi



