five

BSC-LT/NTEU_Multilingual_Evaluation_Dataset

收藏
Hugging Face2025-11-04 更新2026-01-03 收录
下载链接:
https://hf-mirror.com/datasets/BSC-LT/NTEU_Multilingual_Evaluation_Dataset
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 --- # Dataset Card for NTEU Multilingual Evaluation Dataset ## Dataset Description - **Point of Contact:** langtech@bsc.es ### Dataset Summary This evaluation dataset for Machine Translation was created by the [NTEU - Neural Translation for the EU](https://pangeanic.com/nteu) project. The evaluation dataset includes around 1,000 parallel sentences in the 24 official European languages. The original NTEU dataset has been cleaned and filtered by removing empty lines and near-duplicates, and it has been augmented with Catalan. The Catalan version was manually produced by a native Catalan translator from the original English and Spanish versions, and was sponsored by the [AINA project](https://projecteaina.cat/). ### Supported Tasks and Leaderboards This dataset can be used to evaluate bilingual and multilingual machine translation systems for any combination of the 24 official European languages and Catalan in the legal domain. ### Languages The languages included in the dataset are the following: | CODE | LANGUAGE | SCRIPT | |------|-------------|-------------| | bg | Bulgarian | Cyrillic | | ca | Catalan | Latin | | cs | Czech | Latin | | da | Danish | Latin | | de | German | Latin | | el | Greek | Greek | | en | English | Latin | | es | Spanish | Latin | | et | Estonian | Latin | | fi | Finnish | Latin | | fr | French | Latin | | ga | Irish | Latin | | hr | Croatian | Latin | | hu | Hungarian | Latin | | it | Italian | Latin | | lt | Lithuanian | Latin | | lv | Latvian | Latin | | mt | Maltese | Latin | | nl | Dutch | Latin | | pl | Polish | Latin | | pt | Portuguese | Latin | | ro | Romanian | Latin | | sk | Slovak | Latin | | sl | Slovenian | Latin | | sv | Swedish | Latin | ## Dataset Structure ### Data Instances A separate .txt file is provided for each language, with sentences aligned in the same order across all files. Each file uses the two-letter language code of its language as the file extension. ### Data Fields [N/A] ### Data Splits The dataset contains a single split: `Test`. ## Dataset Creation ### Curation Rationale The aim of this dataset is to promote the evaluation of machine translation systems for the official European languages, plus Catalan. ### Source Data #### Initial Data Collection and Normalization The data was originally extracted from [EUR-Lex](https://eur-lex.europa.eu/homepage.html), the official online database of European Union law and other public documents of the European Union (EU), published in the 24 official languages of the EU. The Official Journal (OJ) of the European Union is also published on EUR-Lex. #### Who are the source language producers? [EUR-Lex](https://eur-lex.europa.eu/homepage.html) ### Annotations #### Annotation process The dataset does not contain any annotations. #### Who are the annotators? [N/A] ### Personal and Sensitive Information No specific anonymisation process has been applied, personal and sensitive information may be present in the data. This needs to be considered when using the data for training models. ## Considerations for Using the Data ### Social Impact of Dataset By providing this resource, we intend to promote the evaluation of machine translation systems including all the official European Languages and Catalan, thereby improving the accessibility and visibility of the Catalan language in Europe. ### Discussion of Biases No specific bias mitigation strategies were applied to this dataset. Inherent biases may exist within the data. ### Other Known Limitations The dataset contains data of a legal/administrative domain. Applications of this dataset in other domains would be of limited use. ## Additional Information ### Dataset Curators Language Technologies Unit at the Barcelona Supercomputing Center (langtech@bsc.es). ### Funding This work has been promoted and financed by the Government of Catalonia through the [Aina project](https://projecteaina.cat/). ### Licensing Information This work is licensed under an [Attribution 4.0 International](https://creativecommons.org/licenses/by/4.0/) licence. ### Citation Information For more information about the [NTEU Project](https://pangeanic.com/nteu), please refer to the following paper: ``` @inproceedings{bie-etal-2020-neural, title = "Neural Translation for the {E}uropean {U}nion ({NTEU}) Project", author = "Bi{\'e}, Laurent and Cerd{\`a}-i-Cuc{\'o}, Aleix and Degroote, Hans and Estela, Amando and Garc{\'i}a-Mart{\'i}nez, Mercedes and Herranz, Manuel and Kohan, Alejandro and Melero, Maite and O{'}Dowd, Tony and O{'}Gorman, Sin{\'e}ad and Pinnis, M{\={a}}rcis and Rozis, Roberts and Superbo, Riccardo and Vasi{\c{l}}evskis, Art{\={u}}rs", editor = "Martins, Andr{\'e} and Moniz, Helena and Fumega, Sara and Martins, Bruno and Batista, Fernando and Coheur, Luisa and Parra, Carla and Trancoso, Isabel and Turchi, Marco and Bisazza, Arianna and Moorkens, Joss and Guerberof, Ana and Nurminen, Mary and Marg, Lena and Forcada, Mikel L.", booktitle = "Proceedings of the 22nd Annual Conference of the European Association for Machine Translation", month = nov, year = "2020", address = "Lisboa, Portugal", publisher = "European Association for Machine Translation", url = "https://aclanthology.org/2020.eamt-1.60/", pages = "477--478", abstract = "The Neural Translation for the European Union (NTEU) project aims to build a neural engine farm with all European official language combinations for eTranslation, without the necessity to use a high-resourced language as a pivot. NTEU started in September 2019 and will run until August 2021." } ``` ### Contributions [N/A]
提供机构:
BSC-LT
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作