JParaCrawl v3.0
收藏arXiv2022-02-28 更新2024-06-21 收录
下载链接:
http://www.kecl.ntt.co.jp/icl/lirg/ jparacrawl/
下载链接
链接失效反馈官方服务:
资源简介:
JParaCrawl v3.0是由NTT通信科学实验室创建的一个大规模英语-日语平行语料库,包含超过2100万个独特的平行句对。该数据集通过网络爬虫技术从互联网上收集,并采用自动句子对齐方法进行处理。JParaCrawl v3.0的创建旨在解决英语-日语翻译资源有限的问题,特别是在机器翻译领域。该数据集的应用领域广泛,包括但不限于科学论文、新闻和对话的翻译,显著提升了机器翻译模型的准确性。
JParaCrawl v3.0 is a large-scale English-Japanese parallel corpus created by NTT Communication Science Laboratories, containing over 21 million unique parallel sentence pairs. This dataset is collected from the Internet via web crawling techniques and processed using automatic sentence alignment methods. It was developed to address the scarcity of English-Japanese translation resources, particularly in the field of machine translation. The dataset has a wide range of application scenarios including but not limited to translation of scientific papers, news and dialogues, and significantly improves the accuracy of machine translation models.
提供机构:
NTT通信科学实验室
创建时间:
2022-02-25



