NAIST-SIC-Aligned
收藏arXiv2024-04-01 更新2024-06-21 收录
下载链接:
https://github.com/mingzi151/AHC-SI
下载链接
链接失效反馈官方服务:
资源简介:
NAIST-SIC-Aligned是由奈良先端科学技术大学院大学和莫纳什大学合作创建的一个大规模英日同时翻译语料库。该数据集包含67,079条自动对齐的平行语料,主要来源于专业同声传译员的实时翻译,涵盖技术到娱乐等多个主题。数据集通过两阶段对齐方法创建,包括粗对齐和细粒度对齐,以确保数据质量。该数据集主要用于同时机器翻译系统的训练和评估,旨在提高翻译质量和降低延迟。
NAIST-SIC-Aligned is a large-scale simultaneous English-Japanese translation corpus jointly developed by the Nara Institute of Science and Technology and Monash University. This corpus contains 67,079 automatically aligned parallel sentence pairs, primarily sourced from real-time translations produced by professional simultaneous interpreters and covering a diverse range of topics from technology to entertainment. The corpus was constructed via a two-stage alignment pipeline, including rough alignment and fine-grained alignment, to ensure high data quality. It is mainly utilized for training and evaluating simultaneous machine translation systems, with the goal of enhancing translation quality and reducing latency.
提供机构:
奈良先端科学技术大学院大学
创建时间:
2023-04-24



