MTet
收藏arXiv2025-09-30 收录
下载链接:
https://github.com/stefan-it/nmt-en-vi
下载链接
链接失效反馈官方服务:
资源简介:
该数据集名为MTet,是目前公开可用的最大的英越平行语料库,包含420万高质量的训练句对,以及一个由越南研究界精心打磨的多领域测试集。与其他现有数据源相比,该数据集在规模上更大,覆盖领域更为广泛,其中包括法律和生物医学等技术和影响力较大的领域。该数据集的训练样本量达到了420万例,其任务主要集中在机器翻译领域。
This dataset, named MTet, is the largest publicly available English-Vietnamese parallel corpus currently accessible. It contains 4.2 million high-quality training sentence pairs and a multi-domain test set meticulously curated by the Vietnamese research community. Compared with other existing data sources, this dataset features a larger scale and broader domain coverage, encompassing high-impact technical fields such as law and biomedicine. The training sample size of this dataset amounts to 4.2 million instances, and its core task is focused on the field of machine translation.
提供机构:
Vietnamese research community



