five

日本-英语Coursera数据集

收藏
arXiv2020-01-14 更新2024-08-06 收录
下载链接:
http://arxiv.org/abs/1912.11739v2
下载链接
链接失效反馈
官方服务:
资源简介:
日本-英语Coursera数据集是由京都大学和信息与通信技术国家研究所合作创建的,专注于教育讲座翻译领域。该数据集包含约40,000行日英并行数据,通过自动提取和手动筛选确保高质量的开发和测试集。数据集的创建过程涉及从Coursera课程中提取多语言文档对齐的字幕,并通过机器翻译和句子向量表示的余弦相似度进行句子对齐。该数据集主要用于评估和改进日英教育讲座的机器翻译系统,旨在解决教育资源全球共享中的语言障碍问题。

The Japanese-English Coursera Dataset was co-developed by Kyoto University and the National Institute of Information and Communications Technology, focusing on the domain of educational lecture translation. This dataset contains approximately 40,000 lines of Japanese-English parallel data, ensuring high-quality development and test sets through automatic extraction and manual filtering. The dataset creation process involves extracting multilingual, document-aligned subtitles from Coursera courses, and performing sentence alignment via machine translation and cosine similarity of sentence vector representations. This dataset is primarily utilized to evaluate and enhance machine translation systems for Japanese-English educational lectures, with the goal of resolving language barriers in the global sharing of educational resources.
提供机构:
信息与通信技术国家研究所
创建时间:
2019-12-26
二维码
社区交流群
二维码
科研交流群
商业服务