five

A Parallel Corpus of Theses and Dissertations Abstracts

收藏
arXiv2019-05-06 更新2024-06-21 收录
下载链接:
https://dadosabertos.capes.gov.br/dataset/catalogo-de-teses-edissertacoes-de-2013-a-2016
下载链接
链接失效反馈
官方服务:
资源简介:
本数据集名为‘A Parallel Corpus of Theses and Dissertations Abstracts’,由巴西联邦大学计算机研究所与工程学院合作创建。数据集包含约24万篇来自巴西的硕士和博士论文摘要,涵盖2013至2016年,提供葡萄牙语和英语双语文本。创建过程中,使用Hunalign工具进行自动对齐,确保双语文本的高质量匹配。该数据集主要用于支持统计和神经机器翻译研究,特别是在科学文献翻译领域,旨在提高跨语言信息检索和理解的准确性。

This dataset, named 'A Parallel Corpus of Theses and Dissertations Abstracts', was co-developed by the Institute of Computer Science and the School of Engineering of the Federal University of Brazil. It contains approximately 240,000 abstracts of master's and doctoral theses from Brazil, spanning 2013 through 2016, and offers parallel Portuguese-English text pairs. During its development, the Hunalign automatic alignment tool was employed to guarantee high-quality matching between the parallel text pairs. This dataset is primarily intended to support research on statistical and neural machine translation, especially in the field of scientific literature translation, with the objective of improving the accuracy of cross-lingual information retrieval and comprehension.
提供机构:
联邦大学计算机研究所 - 联邦大学工程学院
创建时间:
2019-05-06
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作