Coursera Parallel Corpus

arXiv2023-11-07 更新2024-06-21 收录

下载链接：

https://github.com/shyyhs/CourseraParallelCorpusMining

下载链接

链接失效反馈

官方服务：

资源简介：

Coursera Parallel Corpus是由日本国立信息通信技术研究所、京都大学和国立情报学研究所合作创建的双语数据集，专注于提升讲座转录的机器翻译质量。该数据集通过从Coursera平台公开可用的讲座中挖掘平行语料，采用动态规划算法基于机器翻译句子的余弦相似度进行句子对齐，确保了数据的高质量。数据集包含约50,543行英日平行语料和40,074行英中平行语料，适用于英日和英中机器翻译系统的开发和评估。通过多阶段微调策略，该数据集能有效提升教育讲座翻译的性能，解决了低资源语言翻译的挑战。

Coursera Parallel Corpus is a bilingual parallel dataset jointly created by the National Institute of Information and Communications Technology (NICT, Japan), Kyoto University, and the National Institute of Informatics (NII). It focuses on improving the quality of machine translation for lecture transcripts. This corpus is constructed by mining parallel corpora from publicly available lectures on the Coursera platform, and uses dynamic programming algorithms to perform sentence alignment based on the cosine similarity of machine-translated sentences, ensuring high data quality. The dataset contains approximately 50,543 English-Japanese parallel sentence pairs and 40,074 English-Chinese parallel sentence pairs, which is suitable for the development and evaluation of English-Japanese and English-Chinese machine translation systems. Through a multi-stage fine-tuning strategy, this corpus can effectively improve the translation performance of educational lectures and address the challenges of low-resource language translation.

提供机构：

日本国立信息通信技术研究所 / 京都, 日本京都大学 / 京都, 日本国立情报学研究所 / 东京, 日本

创建时间：

2023-11-07

5,000+

优质数据集

54 个

任务类型

进入经典数据集