five

UdS-LSV/menyo20k_mt

收藏
Hugging Face2024-01-18 更新2024-06-15 收录
下载链接:
https://hf-mirror.com/datasets/UdS-LSV/menyo20k_mt
下载链接
链接失效反馈
官方服务:
资源简介:
MENYO-20k是一个多领域的平行语料库,包含从新闻文章、TED演讲、电影剧本、广播剧本、科技文本以及其他网络来源和专业翻译人员收集的文本。数据集包含20,100个平行句子,分为10,070个训练句子、3,397个开发句子和6,633个测试句子。数据集支持的任务是翻译,涉及的语言是英语和约鲁巴语。数据集的创建目的是为了提供一个标准化的评估数据集,用于低资源语言对的机器翻译研究。数据集的使用受到Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)许可的限制,禁止商业用途。

MENYO-20k是一个多领域的平行语料库,包含从新闻文章、TED演讲、电影剧本、广播剧本、科技文本以及其他网络来源和专业翻译人员收集的文本。数据集包含20,100个平行句子,分为10,070个训练句子、3,397个开发句子和6,633个测试句子。数据集支持的任务是翻译,涉及的语言是英语和约鲁巴语。数据集的创建目的是为了提供一个标准化的评估数据集,用于低资源语言对的机器翻译研究。数据集的使用受到Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)许可的限制,禁止商业用途。
提供机构:
UdS-LSV
原始信息汇总

数据集概述

数据集简介

名称: MENYO-20k

描述: MENYO-20k 是一个多领域的平行语料库,包含从新闻文章、TED演讲、电影脚本、广播脚本、科技文本以及其他网络和专业翻译人员精选的短文中获取的文本。该数据集包含20,100个平行句子,分为10,070个训练句子、3,397个开发句子和6,633个测试句子(3,419个多领域、1,714个新闻领域和1,500个TED演讲脚本领域)。

语言: 英语(en)和约鲁巴语(yo)

许可证: CC BY-NC 4.0

多语言性: 翻译

任务类别: 翻译

数据集大小: 10K<n<100K

源数据集: 原始数据

数据集结构

数据实例

json { "translation": { "en": "Unit 1: What is Creative Commons?", "yo": "Ìdá 1: Kín ni Creative Commons?" } }

数据字段

  • translation:
    • en: 英语句子
    • yo: 约鲁巴语句子

数据分割

  • 训练集: 10070个样本,2551345字节
  • 验证集: 3397个样本,870011字节
  • 测试集: 6633个样本,1905432字节

数据集创建

标注创建者

  • 专家生成
  • 发现

语言创建者

  • 发现

许可证信息

数据集采用Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)许可证。

引用信息

bibtex @inproceedings{adelani-etal-2021-effect, title = "The Effect of Domain and Diacritics in {Y}oruba{--}{E}nglish Neural Machine Translation", author = "Adelani, David and Ruiter, Dana and Alabi, Jesujoba and Adebonojo, Damilola and Ayeni, Adesina and Adeyemi, Mofe and Awokoya, Ayodele Esther and Espa{~n}a-Bonet, Cristina", booktitle = "Proceedings of the 18th Biennial Machine Translation Summit (Volume 1: Research Track)", month = aug, year = "2021", address = "Virtual", publisher = "Association for Machine Translation in the Americas", url = "https://aclanthology.org/2021.mtsummit-research.6", pages = "61--75", abstract = "Massively multilingual machine translation (MT) has shown impressive capabilities and including zero and few-shot translation between low-resource language pairs. However and these models are often evaluated on high-resource languages with the assumption that they generalize to low-resource ones. The difficulty of evaluating MT models on low-resource pairs is often due to lack of standardized evaluation datasets. In this paper and we present MENYO-20k and the first multi-domain parallel corpus with a especially curated orthography for Yoruba{--}English with standardized train-test splits for benchmarking. We provide several neural MT benchmarks and compare them to the performance of popular pre-trained (massively multilingual) MT models both for the heterogeneous test set and its subdomains. Since these pre-trained models use huge amounts of data with uncertain quality and we also analyze the effect of diacritics and a major characteristic of Yoruba and in the training data. We investigate how and when this training condition affects the final quality of a translation and its understandability.Our models outperform massively multilingual models such as Google ($+8.7$ BLEU) and Facebook M2M ($+9.1$) when translating to Yoruba and setting a high quality benchmark for future research.", }

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作