dsfsi/umsuka-english

Name: dsfsi/umsuka-english
Creator: dsfsi
Published: 2026-03-17 14:05:42
License: 暂无描述

Hugging Face2026-03-17 更新2026-04-05 收录

下载链接：

https://hf-mirror.com/datasets/dsfsi/umsuka-english

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en - zu license: cc-by-4.0 task_categories: - translation pretty_name: Umsuka English-isiZulu Parallel Corpus tags: - AfricaNLP - low-resource - isiZulu configs: - config_name: en-zu data_files: - split: train path: data/en-zu.training.csv - split: validation path: data/en-zu.eval.csv - config_name: zu-en data_files: - split: train path: data/zu-en.training.csv - split: validation path: data/zu-en.eval.csv --- # Umsuka English-isiZulu Parallel Corpus ## Dataset Description - **Homepage:** https://zenodo.org/records/5035171 - **License:** [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/) - **DOI:** 10.5281/zenodo.5035171 - **Version:** 0.1 - **Created:** March 2021 - **Curators:** Rooweither Mabuya; Jade Abbott; Vukosi Marivate ### Summary An open-source, high quality isiZulu parallel corpus sourced from a mixture of domains, taking into account both Southern African context and international English context, translated by professional translators. It contains 5,000 English sentences translated into isiZulu and 5,000 isiZulu sentences translated into English, with 1,000 pairs per direction held out as evaluation data. Since isiZulu is highly morphologically complex, the English-to-isiZulu evaluation set was translated at least twice by different translators to allow calculation of human-level BLEU scores. ## Configs (Subsets) | Config | Description | |--------|-------------| | `en-zu` | English source sentences with isiZulu translations | | `zu-en` | isiZulu source sentences with English translations | ## Dataset Structure ### Data Fields - `translation` — dict with two keys: - `en` — the English sentence - `zu` — the isiZulu sentence - `source` — the data source (e.g. `News Crawl 2019`, `Newspaper`) ### Data Splits | Config | Train | Validation | |--------|------:|----------:| | en-zu | ~8,705 | 998 | | zu-en | ~8,705 | 986 | > The training set totals 9,703 sentence pairs across both directions. ## Curation Rationale The corpus was developed to provide an open-source, high quality isiZulu parallel corpus from a mixture of domains. A pilot translation study on 500 sentences was conducted to provide feedback to translators on translation quality before the full dataset was created. ### Source Data **English sentences** were sampled from the [News Crawl](https://data.statmt.org/news-crawl/) datasets, which are sourced from existing news sources on the internet. Note that News Crawl data reflects perspectives that skew young, white, and male — a translation model trained on such data will likely encode this hegemonic viewpoint. **isiZulu sentences** were sourced from two corpora: - ~2% from isiZulu short stories and novels - ~98% from isiZulu newspaper articles (2012–2016) sampled from [Isolezwe](https://www.isolezwe.co.za/), [Ilanga](https://www.ilanganews.co.za/), and [Ezasegagasini](https://www.ezasegagasini.co.za/) (EThekwini Municipality) publications ### Translator Demographics | Attribute | Detail | |-----------|--------| | Gender | 25 female, 5 male | | Age range | 39–58 | | Country of origin | South Africa | | Language proficiency | Professional Translators, registered with the South African Translators Institute | ### Data Curator Demographics **Rooweither Mabuya** — Professional isiZulu Linguist - isiZulu and English speaker - Compiled the isiZulu sentences for translation - Performed quality checks on the translations - Gave feedback to the translators during the pilot round **Jade Abbott** — NLP Practitioner - English speaker - Sampled the English sentences from the Common Crawl - Performed all data cleaning and filtering - Performed necessary formatting to create formats usable by the translators so they could use their spell checkers ## Preprocessing The following filtering was applied to the source and final datasets: - English statements were removed from the isiZulu set using the `langdetect` Python package - Duplicates were removed - Sentences containing non-ASCII characters were removed - Sentences fewer than 10 characters were removed - Each line was ensured to contain only a single sentence - Physical addresses were removed - Email addresses were removed Some problems were only identified after or during translation, so some sentences were lost — this is why the final counts are lower than the original target of 5,000 per direction. ## Usage ```python from datasets import load_dataset # English → isiZulu en_zu = load_dataset("dsfsi/umsuka-english", "en-zu") # isiZulu → English zu_en = load_dataset("dsfsi/umsuka-english", "zu-en") ``` ## Papers Using This Dataset - Ngomane et al. (2023) — [Unsupervised Cross-lingual Word Embedding Representation for English-isiZulu](https://aclanthology.org/2023.rail-1.2/) — used the corpus to train cross-lingual word embeddings with VecMap for zero-shot news classification between English and isiZulu (RAIL 2023, ACL). ## Citation ```bibtex @dataset{mabuya_umsuka_2021, author = {Mabuya, Rooweither and Abbott, Jade and Marivate, Vukosi}, title = {Umsuka English - isiZulu Parallel Corpus}, year = {2021}, publisher = {Zenodo}, doi = {10.5281/zenodo.5035171}, url = {https://zenodo.org/records/5035171} } ``` ```bibtex @inproceedings{ngomane-etal-2023-unsupervised, title = "Unsupervised Cross-lingual Word Embedding Representation for {E}nglish-isi{Z}ulu", author = "Ngomane, Derwin and Mabuya, Rooweither and Abbott, Jade and Marivate, Vukosi", booktitle = "Proceedings of the Fourth workshop on Resources for African Indigenous Languages (RAIL 2023)", month = may, year = "2023", address = "Dubrovnik, Croatia", publisher = "Association for Computational Linguistics", pages = "11--17", doi = "10.18653/v1/2023.rail-1.2" } ``` ## Acknowledgements Funded by a Facebook Research Grant. Thanks to Paco Guzman and Marc'Aurelio Ranzanto for feedback on the proposal.

提供机构：

dsfsi

5,000+

优质数据集

54 个

任务类型

进入经典数据集