CausalMT
收藏arXiv2025-09-30 收录
下载链接:
https://github.com/EdisonNi-hku/CausalMT
下载链接
链接失效反馈官方服务:
资源简介:
该数据集包含了大量由人类标记翻译方向的翻译对,使得研究者能够探究训练测试方向匹配以及数据模型方向匹配对机器翻译性能的影响。此外,该数据集还允许分析翻译文本与自然写作文本在词汇量及冗余性等方面的翻译特征。具体规模上,该数据集涵盖了三个训练集内超过20万对翻译对,两个训练集内超过9万对翻译对,以及一个包含1,000对翻译对的开发集和一个包含2,000对翻译对的测试集。该数据集的任务是用于机器翻译研究。
This dataset contains a large number of human-annotated translation pairs with marked translation directions, enabling researchers to investigate the impacts of training-test direction matching and data-model direction matching on machine translation performance. Additionally, this dataset allows for the analysis of translation characteristics such as vocabulary size and redundancy between translated texts and naturally written texts. In terms of scale, this dataset includes three training sets with over 200,000 translation pairs each, two training sets with over 90,000 translation pairs each, a development set containing 1,000 translation pairs, and a test set containing 2,000 translation pairs. This dataset is designed for machine translation research.



