WMT corpora

arXiv2025-09-30 收录

下载链接：

http://statmt.org/wmt15/translation-task.html

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集包括两个从非领域语料库中提取的选定领域内数据集，分别包含约90万和110万条从原始310万条句子中筛选出的句子。数据的筛选基于与领域内数据集的相似性，这有助于提升翻译性能。规模上，这两个数据集分别约为90万和110万条句子，用于对神经机器翻译模型进行微调任务。

This dataset includes two selected in-domain datasets extracted from out-of-domain corpora. Each set contains approximately 900,000 and 1.1 million sentences filtered from the original 3.1 million sentences. The filtering is conducted based on similarity to the target in-domain datasets, which helps improve translation performance. These two datasets, with respective sizes of around 900,000 and 1.1 million sentences, are used for fine-tuning neural machine translation models.

提供机构：

WMT

5,000+

优质数据集

54 个

任务类型

进入经典数据集