five

MTet

收藏
arXiv2025-09-30 收录
下载链接:
https://github.com/stefan-it/nmt-en-vi
下载链接
链接失效反馈
官方服务:
资源简介:
该数据集名为MTet,是目前公开可用的最大的英越平行语料库,包含420万高质量的训练句对,以及一个由越南研究界精心打磨的多领域测试集。与其他现有数据源相比,该数据集在规模上更大,覆盖领域更为广泛,其中包括法律和生物医学等技术和影响力较大的领域。该数据集的训练样本量达到了420万例,其任务主要集中在机器翻译领域。

This dataset, named MTet, is the largest publicly available English-Vietnamese parallel corpus currently accessible. It contains 4.2 million high-quality training sentence pairs and a multi-domain test set meticulously curated by the Vietnamese research community. Compared with other existing data sources, this dataset features a larger scale and broader domain coverage, encompassing high-impact technical fields such as law and biomedicine. The training sample size of this dataset amounts to 4.2 million instances, and its core task is focused on the field of machine translation.
提供机构:
Vietnamese research community
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作