ELITR ECA Corpus

Name: ELITR ECA Corpus
Creator: 信息学院，爱丁堡大学，苏格兰
Published: 2021-09-15 23:03:27
License: 暂无描述

arXiv2021-09-15 更新2024-06-21 收录

下载链接：

http://data.statmt.org/elitr-eca

下载链接

链接失效反馈

官方服务：

资源简介：

ELITR ECA Corpus是由爱丁堡大学信息学院创建的多语言数据集，源自欧洲审计法院的出版物。该数据集包含264,000对文档和4190万对句子，覆盖506个翻译方向。创建过程中，研究团队从欧洲审计法院网站下载PDF报告，提取纯文本，并使用多语言神经机器翻译系统自动翻译所有文本，再通过Bleualign工具识别平行句子。此数据集主要用于机器翻译研究，旨在解决多语言翻译数据稀缺的问题。

The ELITR ECA Corpus is a multilingual dataset developed by the School of Informatics, University of Edinburgh, sourced from publications of the European Court of Auditors. It contains 264,000 document pairs and 41.9 million sentence pairs, spanning 506 translation directions. During its construction, the research team downloaded PDF reports from the official website of the European Court of Auditors, extracted plain text from the documents, automatically translated all extracted texts via a multilingual neural machine translation system, and then identified parallel sentence pairs using the Bleualign tool. This dataset is primarily utilized for machine translation research, aiming to address the scarcity of multilingual translation training data.

提供机构：

信息学院，爱丁堡大学，苏格兰

创建时间：

2021-09-15

5,000+

优质数据集

54 个

任务类型

进入经典数据集