five

MENYO-20k: A Multi-domain English - Yorùbá Corpus for Machine Translation

收藏
NIAID Data Ecosystem2026-03-12 收录
下载链接:
https://zenodo.org/record/4297447
下载链接
链接失效反馈
官方服务:
资源简介:
MENYO-20k is a multi-domain parallel dataset with texts obtained from news articles, ted talks, movie transcripts, radio transcripts, science and technology texts, and other short articles curated from the web and professional translators. The dataset has 20,100 parallel sentences split into 10,070 training sentences, 3,397 development sentences, and 6,633 test sentences (3,419 multi-domain, 1,714 news domain, and 1,500 ted talks speech transcript domain) The dataset is open but for non-commercial use because some of the data sources like Ted talks and JW news requires permission for commercial use. Acknowledgement: This project was supported by the AI4D language dataset fellowship through K4All and Zindi Africa
创建时间:
2020-12-01
二维码
社区交流群
二维码
科研交流群
商业服务