A Parallel Corpus of Thesis and Dissertations Abstracts

Figshare2019-01-21 更新2026-04-08 收录

下载链接：

https://figshare.com/articles/A_Parallel_Corpus_of_Thesis_and_Dissertations_Abstracts/5995519/2

下载链接

链接失效反馈

官方服务：

资源简介：

NOTE FOR WMT PARTICIPANTS:There is an easier version for MT available in Moses format (one sentence per line. The files start with moses_like.<br>If you use this dataset, please cite the following work:<pre>@inproceedings{soares2018parallel, title={A Parallel Corpus of Theses and Dissertations Abstracts}, author={Soares, Felipe and Yamashita, Gabrielli Harumi and Anzanello, Michel Jose}, booktitle={International Conference on Computational Processing of the Portuguese Language}, pages={345--352}, year={2018}, organization={Springer} }</pre><pre><br></pre><br>In Brazil, the governmental body responsible for overseeing and coordinating post-graduate pro-grams, CAPES, keeps records of all thesis and dissertations presented in the country. Informa-tion regarding such documents can be accessed online in the Thesis and Dissertations Catalog(TDC), which contains abstracts in Portuguese and English, and additional data regarding suchdocuments. Thus, this database can be a potential source of parallel corpora for the Portugueseand English languages. In this article, we present the development of a parallel corpus from TDC,which is made available by CAPES under the open data initiative. Approximately 240,000 doc-uments were collected and aligned using the Hunalign algorithm. We demontrate the capabilityof our developed corpus by training Statistical Machine Translation (SMT) and Neural MachineTranslation (NMT) models for both language directions, followed by a comparison with GoogleTranslator (GT). Our both translation models presented better BLEU scores than GT, with NMTsystem being the most accurate one. Sentence alignment was also manually evaluated, presentingan average of XX% correctly aligned sentences. Our parallel corpus is freely available in TMXformat, with complementary infomration regarding document metadata.

WMT参与者须知：针对机器翻译任务，我们提供了更易用的Moses格式（Moses format）版本，该版本每行对应一个句子，文件前缀为moses_like.。若您使用本数据集，请引用如下文献： @inproceedings{soares2018parallel, title={《学位论文摘要平行语料库》}, author={Soares, Felipe and Yamashita, Gabrielli Harumi and Anzanello, Michel Jose}, booktitle={葡萄牙语计算处理国际会议}, pages={345--352}, year={2018}, organization={Springer} } 在巴西，负责监督并统筹研究生培养项目的政府机构CAPES，留存了该国所有提交的学位论文与毕业论文记录。上述文献的相关信息可通过学位论文目录系统（Thesis and Dissertations Catalog, TDC）在线获取，该系统收录了葡萄牙语与英语的摘要，以及相关文献的附加元数据。因此，该数据库可作为葡萄牙语-英语平行语料库的潜在来源。本文介绍了基于TDC构建平行语料库的完整流程，该语料库由CAPES依托开放数据计划对外发布。研究团队通过Hunalign对齐算法（Hunalign），共收集并对齐了约24万份文献。我们通过为双向语言对训练统计机器翻译（Statistical Machine Translation, SMT）与神经机器翻译（Neural Machine Translation, NMT）模型，验证了本语料库的性能，并将结果与谷歌翻译（Google Translator, GT）进行对比。两款翻译模型的BLEU评分均优于谷歌翻译，其中神经机器翻译系统的表现最为精准。句子对齐效果同时经过人工评估，平均对齐准确率为XX%。本平行语料库以TMX格式（TMX）免费开放，并附带完整的文献元数据补充信息。

创建时间：

2019-01-21

5,000+

优质数据集

54 个

任务类型

进入经典数据集