umsuka-english
收藏Hugging Face2026-03-16 更新2026-03-20 收录
下载链接:
https://huggingface.co/datasets/dsfsi/umsuka-english
下载链接
链接失效反馈官方服务:
资源简介:
Umsuka 英语-isiZulu 平行语料库是一个开源的高质量平行语料库,包含来自多个领域的英语和 isiZulu 句子对。该语料库由专业翻译人员翻译,考虑了南非语境和国际英语语境。数据集包含 5,000 个英语句子翻译为 isiZulu 和 5,000 个 isiZulu 句子翻译为英语,其中每个方向有 1,000 对作为评估数据。由于 isiZulu 具有高度形态复杂性,英语到 isiZulu 的评估集由不同翻译人员至少翻译两次,以便计算人类水平的 BLEU 分数。数据集结构包括两个配置:`en-zu`(英语到 isiZulu)和 `zu-en`(isiZulu 到英语),每个配置包含训练集和验证集。数据字段包括 `translation`(包含英语和 isiZulu 句子)和 `source`(数据来源)。数据预处理包括去除重复、非 ASCII 字符、短句子等。该数据集适用于低资源语言翻译任务,特别是非洲语言 NLP 研究。
The Umsuka English-isiZulu Parallel Corpus is an open-source high-quality parallel corpus containing English and isiZulu sentence pairs from multiple domains. This corpus was translated by professional translators, taking into account both South African contextual settings and international English contexts. The dataset includes 5,000 English sentences translated into isiZulu and 5,000 isiZulu sentences translated into English, with 1,000 pairs reserved as evaluation data for each direction. Given the highly complex morphology of isiZulu, the English-to-isiZulu evaluation set was translated at least twice by different translators to enable calculation of human-level BLEU scores. The dataset features two configurations: `en-zu` (English to isiZulu) and `zu-en` (isiZulu to English), each containing a training split and a validation split. The data fields include `translation` (containing English and isiZulu sentences) and `source` (data source). Data preprocessing includes removing duplicates, non-ASCII characters, short sentences, etc. This dataset is suitable for low-resource language translation tasks, especially African language NLP research.
提供机构:
Data Science for Social Impact
创建时间:
2026-03-16



