Scielo平行语料库
收藏arXiv2019-05-06 更新2024-08-06 收录
下载链接:
http://arxiv.org/abs/1905.01852v1
下载链接
链接失效反馈官方服务:
资源简介:
Scielo平行语料库是由联邦大学里约格兰德多苏尔分校信息学研究所创建的多语言科学文章数据集,包含英语、葡萄牙语和西班牙语三种语言。该数据集从Scielo数据库中提取,涵盖多个科学领域,总计约32,756篇文章。创建过程中,使用Hunalign算法自动对句子进行对齐,确保了高度的准确性。该数据集主要用于统计机器翻译系统的训练和评估,特别是在科学文章的跨语言翻译和文本分析中展现出优异的性能。
The Scielo Parallel Corpus is a multilingual scientific article dataset developed by the Institute of Informatics, Federal University of Rio Grande do Sul, covering three languages: English, Portuguese and Spanish. Extracted from the Scielo database, this dataset spans multiple scientific fields and contains approximately 32,756 articles in total. During its development, sentence-level alignment was automatically conducted using the Hunalign algorithm, which ensures high alignment accuracy. This dataset is mainly used for training and evaluating statistical machine translation systems, and has demonstrated excellent performance in cross-language translation and text analysis of scientific articles.
提供机构:
联邦大学里约格兰德多苏尔分校信息学研究所
创建时间:
2019-05-06



