UdS-LSV/menyo20k_mt
收藏数据集概述
数据集简介
名称: MENYO-20k
描述: MENYO-20k 是一个多领域的平行语料库,包含从新闻文章、TED演讲、电影脚本、广播脚本、科技文本以及其他网络和专业翻译人员精选的短文中获取的文本。该数据集包含20,100个平行句子,分为10,070个训练句子、3,397个开发句子和6,633个测试句子(3,419个多领域、1,714个新闻领域和1,500个TED演讲脚本领域)。
语言: 英语(en)和约鲁巴语(yo)
许可证: CC BY-NC 4.0
多语言性: 翻译
任务类别: 翻译
数据集大小: 10K<n<100K
源数据集: 原始数据
数据集结构
数据实例
json { "translation": { "en": "Unit 1: What is Creative Commons?", "yo": "Ìdá 1: Kín ni Creative Commons?" } }
数据字段
translation:en: 英语句子yo: 约鲁巴语句子
数据分割
- 训练集: 10070个样本,2551345字节
- 验证集: 3397个样本,870011字节
- 测试集: 6633个样本,1905432字节
数据集创建
标注创建者
- 专家生成
- 发现
语言创建者
- 发现
许可证信息
数据集采用Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)许可证。
引用信息
bibtex @inproceedings{adelani-etal-2021-effect, title = "The Effect of Domain and Diacritics in {Y}oruba{--}{E}nglish Neural Machine Translation", author = "Adelani, David and Ruiter, Dana and Alabi, Jesujoba and Adebonojo, Damilola and Ayeni, Adesina and Adeyemi, Mofe and Awokoya, Ayodele Esther and Espa{~n}a-Bonet, Cristina", booktitle = "Proceedings of the 18th Biennial Machine Translation Summit (Volume 1: Research Track)", month = aug, year = "2021", address = "Virtual", publisher = "Association for Machine Translation in the Americas", url = "https://aclanthology.org/2021.mtsummit-research.6", pages = "61--75", abstract = "Massively multilingual machine translation (MT) has shown impressive capabilities and including zero and few-shot translation between low-resource language pairs. However and these models are often evaluated on high-resource languages with the assumption that they generalize to low-resource ones. The difficulty of evaluating MT models on low-resource pairs is often due to lack of standardized evaluation datasets. In this paper and we present MENYO-20k and the first multi-domain parallel corpus with a especially curated orthography for Yoruba{--}English with standardized train-test splits for benchmarking. We provide several neural MT benchmarks and compare them to the performance of popular pre-trained (massively multilingual) MT models both for the heterogeneous test set and its subdomains. Since these pre-trained models use huge amounts of data with uncertain quality and we also analyze the effect of diacritics and a major characteristic of Yoruba and in the training data. We investigate how and when this training condition affects the final quality of a translation and its understandability.Our models outperform massively multilingual models such as Google ($+8.7$ BLEU) and Facebook M2M ($+9.1$) when translating to Yoruba and setting a high quality benchmark for future research.", }



