Multilingual Dataset
收藏arXiv2025-09-30 收录
下载链接:
http://phontron.com/data/ted_talks.tar.gz
下载链接
链接失效反馈官方服务:
资源简介:
该数据集是一个包含9种语言和56个零资源翻译方向的多元语言数据集。具体包括英语、法语、捷克语、德语、芬兰语、爱沙尼亚语、罗马尼亚语、印地语和土耳其语。该数据集的训练数据来源于每种语言的最新可用年份,而单语英语数据则是从新闻爬虫中抽取的。每个语言对的平行数据规模限制在1000万个样本以内,旨在应对零资源神经机器翻译任务。
This dataset is a multilingual language corpus containing 9 languages and 56 zero-resource translation directions. Specifically, it covers English, French, Czech, German, Finnish, Estonian, Romanian, Hindi, and Turkish. The training data is sourced from the latest available corpora for each respective language, while the monolingual English data is extracted from news crawls. The parallel data for each language pair is limited to no more than 10 million samples, which is designed to address zero-resource neural machine translation tasks.
提供机构:
WMT benchmark, TED Talks



