A Parallel Corpus of Thesis and Dissertations Abstracts

Name: A Parallel Corpus of Thesis and Dissertations Abstracts
Creator: figshare
Published: 2025-06-01 05:29:12
License: 暂无描述

DataCite Commons2025-06-01 更新2024-07-27 收录

下载链接：

https://figshare.com/articles/A_Parallel_Corpus_of_Thesis_and_Dissertations_Abstracts/5995519/2

下载链接

链接失效反馈

官方服务：

资源简介：

NOTE FOR WMT PARTICIPANTS:There is an easier version for MT available in Moses format (one sentence per line. The files start with moses_like.<br>If you use this dataset, please cite the following work:<pre>@inproceedings{soares2018parallel, title={A Parallel Corpus of Theses and Dissertations Abstracts}, author={Soares, Felipe and Yamashita, Gabrielli Harumi and Anzanello, Michel Jose}, booktitle={International Conference on Computational Processing of the Portuguese Language}, pages={345--352}, year={2018}, organization={Springer} }</pre><pre><br></pre><br>In Brazil, the governmental body responsible for overseeing and coordinating post-graduate pro-grams, CAPES, keeps records of all thesis and dissertations presented in the country. Informa-tion regarding such documents can be accessed online in the Thesis and Dissertations Catalog(TDC), which contains abstracts in Portuguese and English, and additional data regarding suchdocuments. Thus, this database can be a potential source of parallel corpora for the Portugueseand English languages. In this article, we present the development of a parallel corpus from TDC,which is made available by CAPES under the open data initiative. Approximately 240,000 doc-uments were collected and aligned using the Hunalign algorithm. We demontrate the capabilityof our developed corpus by training Statistical Machine Translation (SMT) and Neural MachineTranslation (NMT) models for both language directions, followed by a comparison with GoogleTranslator (GT). Our both translation models presented better BLEU scores than GT, with NMTsystem being the most accurate one. Sentence alignment was also manually evaluated, presentingan average of XX% correctly aligned sentences. Our parallel corpus is freely available in TMXformat, with complementary infomration regarding document metadata.

WMT参会者须知：面向机器翻译（MT）提供了Moses格式的简易版本（每行对应一个句子，文件前缀为moses_like）。若使用本数据集，请引用如下文献： <pre>@inproceedings{soares2018parallel, title={《学位论文摘要平行语料库》（A Parallel Corpus of Theses and Dissertations Abstracts）}, author={Soares, Felipe and Yamashita, Gabrielli Harumi and Anzanello, Michel Jose}, booktitle={《葡萄牙语言计算处理国际会议》（International Conference on Computational Processing of the Portuguese Language）}, pages={345--352}, year={2018}, organization={Springer（施普林格）} }</pre> <pre> </pre> 在巴西，负责监督与统筹研究生教育项目的政府机构CAPES（巴西高等教育人员发展协调总署，Coordenação de Aperfeiçoamento de Pessoal de Nível Superior）留存了该国所有提交的学位论文记录。相关文档信息可通过学位论文目录（Thesis and Dissertations Catalog, TDC）在线获取，该目录包含葡萄牙语与英语的摘要，以及此类文档的附加元数据。因此，该数据库可作为葡萄牙语与英语平行语料库的潜在来源。本文介绍了依托TDC构建平行语料库的全过程，该语料库由CAPES依托开放数据计划发布。我们通过Hunalign算法收集并对齐了约24万份文档。为验证所构建语料库的性能，我们针对两种语言方向分别训练了统计机器翻译（Statistical Machine Translation, SMT）与神经机器翻译（Neural Machine Translation, NMT）模型，并与谷歌翻译（Google Translator, GT）开展对比实验。两款翻译模型的BLEU评分均优于谷歌翻译，其中神经机器翻译系统的准确率最高。我们还对句子对齐结果进行了人工评估，平均对齐准确率为XX%。本平行语料库以TMX格式免费开放，附带文档元数据的补充信息。

提供机构：

figshare

创建时间：

2019-01-21

5,000+

优质数据集

54 个

任务类型

进入经典数据集