UdS-LSV/menyo20k_mt

Name: UdS-LSV/menyo20k_mt
Creator: UdS-LSV
Published: 2024-01-18 11:08:52
License: 暂无描述

Hugging Face2024-01-18 更新2024-06-15 收录

下载链接：

https://hf-mirror.com/datasets/UdS-LSV/menyo20k_mt

下载链接

链接失效反馈

官方服务：

资源简介：

MENYO-20k是一个多领域的平行语料库，包含从新闻文章、TED演讲、电影剧本、广播剧本、科技文本以及其他网络来源和专业翻译人员收集的文本。数据集包含20,100个平行句子，分为10,070个训练句子、3,397个开发句子和6,633个测试句子。数据集支持的任务是翻译，涉及的语言是英语和约鲁巴语。数据集的创建目的是为了提供一个标准化的评估数据集，用于低资源语言对的机器翻译研究。数据集的使用受到Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)许可的限制，禁止商业用途。

提供机构：

UdS-LSV

原始信息汇总

数据集概述

数据集简介

名称： MENYO-20k

描述： MENYO-20k 是一个多领域的平行语料库，包含从新闻文章、TED演讲、电影脚本、广播脚本、科技文本以及其他网络和专业翻译人员精选的短文中获取的文本。该数据集包含20,100个平行句子，分为10,070个训练句子、3,397个开发句子和6,633个测试句子（3,419个多领域、1,714个新闻领域和1,500个TED演讲脚本领域）。

语言： 英语（en）和约鲁巴语（yo）

许可证： CC BY-NC 4.0

多语言性： 翻译

任务类别： 翻译

数据集大小： 10K<n<100K

源数据集： 原始数据

数据集结构

数据实例

json { "translation": { "en": "Unit 1: What is Creative Commons?", "yo": "Ìdá 1: Kín ni Creative Commons?" } }

数据字段

translation:
- en: 英语句子
- yo: 约鲁巴语句子

数据分割

训练集： 10070个样本，2551345字节
验证集： 3397个样本，870011字节
测试集： 6633个样本，1905432字节

数据集创建

标注创建者

专家生成
发现

语言创建者

发现

许可证信息

数据集采用Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)许可证。

引用信息

bibtex @inproceedings{adelani-etal-2021-effect, title = "The Effect of Domain and Diacritics in {Y}oruba{--}{E}nglish Neural Machine Translation", author = "Adelani, David and Ruiter, Dana and Alabi, Jesujoba and Adebonojo, Damilola and Ayeni, Adesina and Adeyemi, Mofe and Awokoya, Ayodele Esther and Espa{~n}a-Bonet, Cristina", booktitle = "Proceedings of the 18th Biennial Machine Translation Summit (Volume 1: Research Track)", month = aug, year = "2021", address = "Virtual", publisher = "Association for Machine Translation in the Americas", url = "https://aclanthology.org/2021.mtsummit-research.6", pages = "61--75", abstract = "Massively multilingual machine translation (MT) has shown impressive capabilities and including zero and few-shot translation between low-resource language pairs. However and these models are often evaluated on high-resource languages with the assumption that they generalize to low-resource ones. The difficulty of evaluating MT models on low-resource pairs is often due to lack of standardized evaluation datasets. In this paper and we present MENYO-20k and the first multi-domain parallel corpus with a especially curated orthography for Yoruba{--}English with standardized train-test splits for benchmarking. We provide several neural MT benchmarks and compare them to the performance of popular pre-trained (massively multilingual) MT models both for the heterogeneous test set and its subdomains. Since these pre-trained models use huge amounts of data with uncertain quality and we also analyze the effect of diacritics and a major characteristic of Yoruba and in the training data. We investigate how and when this training condition affects the final quality of a translation and its understandability.Our models outperform massively multilingual models such as Google ($+8.7$ BLEU) and Facebook M2M ($+9.1$) when translating to Yoruba and setting a high quality benchmark for future research.", }

5,000+

优质数据集

54 个

任务类型

进入经典数据集