WMT corpora
收藏arXiv2025-09-30 收录
下载链接:
http://statmt.org/wmt15/translation-task.html
下载链接
链接失效反馈官方服务:
资源简介:
该数据集包括两个从非领域语料库中提取的选定领域内数据集,分别包含约90万和110万条从原始310万条句子中筛选出的句子。数据的筛选基于与领域内数据集的相似性,这有助于提升翻译性能。规模上,这两个数据集分别约为90万和110万条句子,用于对神经机器翻译模型进行微调任务。
This dataset includes two selected in-domain datasets extracted from out-of-domain corpora. Each set contains approximately 900,000 and 1.1 million sentences filtered from the original 3.1 million sentences. The filtering is conducted based on similarity to the target in-domain datasets, which helps improve translation performance. These two datasets, with respective sizes of around 900,000 and 1.1 million sentences, are used for fine-tuning neural machine translation models.
提供机构:
WMT



