qanastek/EMEA-V3
收藏Hugging Face2022-10-22 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/qanastek/EMEA-V3
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- machine-generated
- expert-generated
language_creators:
- found
language:
- bg
- cs
- da
- de
- el
- en
- es
- et
- fi
- fr
- hu
- it
- lt
- lv
- mt
- nl
- pl
- pt
- ro
- sk
- sl
- sv
multilinguality:
- bg
- cs
- da
- de
- el
- en
- es
- et
- fi
- fr
- hu
- it
- lt
- lv
- mt
- nl
- pl
- pt
- ro
- sk
- sl
- sv
pretty_name: EMEA-V3
size_categories:
- 100K<n<1M
source_datasets:
- extended
task_categories:
- translation
- machine-translation
task_ids:
- translation
- machine-translation
---
# EMEA-V3 : European parallel translation corpus from the European Medicines Agency
## Table of Contents
- [Dataset Card for [Needs More Information]](#dataset-card-for-needs-more-information)
- [Table of Contents](#table-of-contents)
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Initial Data Collection and Normalization](#initial-data-collection-and-normalization)
- [Who are the source language producers?](#who-are-the-source-language-producers)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
## Dataset Description
- **Homepage:** https://opus.nlpl.eu/EMEA.php
- **Repository:** https://github.com/qanastek/EMEA-V3/
- **Paper:** https://aclanthology.org/L12-1246/
- **Leaderboard:** [Needs More Information]
- **Point of Contact:** [Yanis Labrak](mailto:yanis.labrak@univ-avignon.fr)
### Dataset Summary
`EMEA-V3` is a parallel corpus for neural machine translation collected and aligned by [Tiedemann, Jorg](mailto:jorg.tiedemann@lingfil.uu.se) during the [OPUS project](https://opus.nlpl.eu/).
### Supported Tasks and Leaderboards
`translation`: The dataset can be used to train a model for translation.
### Languages
In our case, the corpora consists of a pair of source and target sentences for all 22 different languages from the European Union (EU).
**List of languages :** `Bulgarian (bg)`,`Czech (cs)`,`Danish (da)`,`German (de)`,`Greek (el)`,`English (en)`,`Spanish (es)`,`Estonian (et)`,`Finnish (fi)`,`French (fr)`,`Hungarian (hu)`,`Italian (it)`,`Lithuanian (lt)`,`Latvian (lv)`,`Maltese (mt)`,`Dutch (nl)`,`Polish (pl)`,`Portuguese (pt)`,`Romanian (ro)`,`Slovak (sk)`,`Slovenian (sl)`,`Swedish (sv)`.
## Load the dataset with HuggingFace
```python
from datasets import load_dataset
dataset = load_dataset("qanastek/EMEA-V3", split='train', download_mode='force_redownload')
print(dataset)
print(dataset[0])
```
## Dataset Structure
### Data Instances
```plain
lang,source_text,target_text
bg-cs,EMEA/ H/ C/ 471,EMEA/ H/ C/ 471
bg-cs,ABILIFY,ABILIFY
bg-cs,Какво представлява Abilify?,Co je Abilify?
bg-cs,"Abilify е лекарство, съдържащо активното вещество арипипразол.","Abilify je léčivý přípravek, který obsahuje účinnou látku aripiprazol."
bg-cs,"Предлага се под формата на таблетки от 5 mg, 10 mg, 15 mg и 30 mg, като диспергиращи се таблетки (таблетки, които се разтварят в устата) от 10 mg, 15 mg и 30 mg, като перорален разтвор (1 mg/ ml) и като инжекционен разтвор (7, 5 mg/ ml).","Je dostupný ve formě tablet s obsahem 5 mg, 10 mg, 15 mg a 30 mg, ve formě tablet dispergovatelných v ústech (tablet, které se rozpustí v ústech) s obsahem 10 mg, 15 mg a 30 mg, jako perorální roztok (1 mg/ ml) nebo jako injekční roztok (7, 5 mg/ ml)."
bg-cs,За какво се използва Abilify?,Na co se přípravek Abilify používá?
```
### Data Fields
**lang** : The pair of source and target language of type `String`.
**source_text** : The source text of type `String`.
**target_text** : The target text of type `String`.
### Data Splits
| | bg | cs | da | de | el | en | es | et | fi | fr | hu | it | lt | lv | mt | nl | pl | pt | ro | sk | sl | sv |
|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|
| **bg** | 0 | 342378 | 349675 | 348061 | 355696 | 333066 | 349936 | 336142 | 341732 | 358045 | 352763 | 351669 | 348679 | 342721 | 351097 | 353942 | 355005 | 347925 | 351099 | 345572 | 346954 | 342927 |
| **cs** | 342378 | 0 | 354824 | 353397 | 364609 | 335716 | 356506 | 340309 | 349040 | 363614 | 358353 | 357578 | 353232 | 347807 | 334353 | 355192 | 358357 | 351244 | 330447 | 346835 | 348411 | 346894 |
| **da** | 349675 | 354824 | 0 | 387202 | 397654 | 360186 | 387329 | 347391 | 379830 | 396294 | 367091 | 388495 | 360572 | 353801 | 342263 | 388250 | 368779 | 382576 | 340508 | 356890 | 357694 | 373510 |
| **de** | 348061 | 353397 | 387202 | 0 | 390281 | 364005 | 386335 | 346166 | 378626 | 393468 | 366828 | 381396 | 360907 | 353151 | 340294 | 377770 | 367080 | 381365 | 337562 | 355805 | 358700 | 376925 |
| **el** | 355696 | 364609 | 397654 | 390281 | 0 | 372824 | 393051 | 354874 | 384889 | 403248 | 373706 | 391389 | 368576 | 360047 | 348221 | 396284 | 372486 | 387170 | 342655 | 364959 | 363778 | 384569 |
| **en** | 333066 | 335716 | 360186 | 364005 | 372824 | 0 | 366769 | 333667 | 357177 | 373152 | 349176 | 361089 | 339899 | 336306 | 324695 | 360418 | 348450 | 361393 | 321233 | 338649 | 338195 | 352587 |
| **es** | 349936 | 356506 | 387329 | 386335 | 393051 | 366769 | 0 | 348454 | 378158 | 394253 | 368203 | 378076 | 360645 | 354126 | 340297 | 381188 | 367091 | 376443 | 337302 | 358745 | 357961 | 379462 |
| **et** | 336142 | 340309 | 347391 | 346166 | 354874 | 333667 | 348454 | 0 | 341694 | 358012 | 352099 | 351747 | 345417 | 339042 | 337302 | 350911 | 354329 | 345856 | 325992 | 343950 | 342787 | 340761 |
| **fi** | 341732 | 349040 | 379830 | 378626 | 384889 | 357177 | 378158 | 341694 | 0 | 387478 | 358869 | 379862 | 352968 | 346820 | 334275 | 379729 | 358760 | 374737 | 331135 | 348559 | 348680 | 368528 |
| **fr** | 358045 | 363614 | 396294 | 393468 | 403248 | 373152 | 394253 | 358012 | 387478 | 0 | 373625 | 385869 | 368817 | 361137 | 347699 | 388607 | 372387 | 388658 | 344139 | 363249 | 366474 | 383274 |
| **hu** | 352763 | 358353 | 367091 | 366828 | 373706 | 349176 | 368203 | 352099 | 358869 | 373625 | 0 | 367937 | 361015 | 354872 | 343831 | 368387 | 369040 | 361652 | 340410 | 357466 | 361157 | 356426 |
| **it** | 351669 | 357578 | 388495 | 381396 | 391389 | 361089 | 378076 | 351747 | 379862 | 385869 | 367937 | 0 | 360783 | 356001 | 341552 | 384018 | 365159 | 378841 | 337354 | 357562 | 358969 | 377635 |
| **lt** | 348679 | 353232 | 360572 | 360907 | 368576 | 339899 | 360645 | 345417 | 352968 | 368817 | 361015 | 360783 | 0 | 350576 | 337339 | 362096 | 361497 | 357070 | 335581 | 351639 | 350916 | 349636 |
| **lv** | 342721 | 347807 | 353801 | 353151 | 360047 | 336306 | 354126 | 339042 | 346820 | 361137 | 354872 | 356001 | 350576 | 0 | 336157 | 355791 | 358607 | 349590 | 329581 | 348689 | 346862 | 345016 |
| **mt** | 351097 | 334353 | 342263 | 340294 | 348221 | 324695 | 340297 | 337302 | 334275 | 347699 | 343831 | 341552 | 337339 | 336157 | 0 | 341111 | 344764 | 335553 | 338137 | 335930 | 334491 | 335353 |
| **nl** | 353942 | 355192 | 388250 | 377770 | 396284 | 360418 | 381188 | 350911 | 379729 | 388607 | 368387 | 384018 | 362096 | 355791 | 341111 | 0 | 369694 | 383913 | 339047 | 359126 | 360054 | 379771 |
| **pl** | 355005 | 358357 | 368779 | 367080 | 372486 | 348450 | 367091 | 354329 | 358760 | 372387 | 369040 | 365159 | 361497 | 358607 | 344764 | 369694 | 0 | 357426 | 335243 | 352527 | 355534 | 353214 |
| **pt** | 347925 | 351244 | 382576 | 381365 | 387170 | 361393 | 376443 | 345856 | 374737 | 388658 | 361652 | 378841 | 357070 | 349590 | 335553 | 383913 | 357426 | 0 | 333365 | 354784 | 352673 | 373392 |
| **ro** | 351099 | 330447 | 340508 | 337562 | 342655 | 321233 | 337302 | 325992 | 331135 | 344139 | 340410 | 337354 | 335581 | 329581 | 338137 | 339047 | 335243 | 333365 | 0 | 332373 | 330329 | 331268 |
| **sk** | 345572 | 346835 | 356890 | 355805 | 364959 | 338649 | 358745 | 343950 | 348559 | 363249 | 357466 | 357562 | 351639 | 348689 | 335930 | 359126 | 352527 | 354784 | 332373 | 0 | 348396 | 346855 |
| **sl** | 346954 | 348411 | 357694 | 358700 | 363778 | 338195 | 357961 | 342787 | 348680 | 366474 | 361157 | 358969 | 350916 | 346862 | 334491 | 360054 | 355534 | 352673 | 330329 | 348396 | 0 | 347727 |
| **sv** | 342927 | 346894 | 373510 | 376925 | 384569 | 352587 | 379462 | 340761 | 368528 | 383274 | 356426 | 377635 | 349636 | 345016 | 335353 | 379771 | 353214 | 373392 | 331268 | 346855 | 347727 | 0 |
## Dataset Creation
### Curation Rationale
For details, check the corresponding [pages](https://opus.nlpl.eu/EMEA.php).
### Source Data
<!-- #### Initial Data Collection and Normalization
ddd -->
#### Who are the source language producers?
Every data of this corpora as been uploaded by [Tiedemann, Jorg](mailto:jorg.tiedemann@lingfil.uu.se) on [Opus](https://opus.nlpl.eu/EMEA.php).
### Personal and Sensitive Information
The corpora is free of personal or sensitive information.
## Considerations for Using the Data
### Other Known Limitations
The nature of the task introduce a variability in the quality of the target translations.
## Additional Information
### Dataset Curators
__Hugging Face EMEA-V3__: Labrak Yanis, Dufour Richard (Not affiliated with the original corpus)
__OPUS : Parallel Data, Tools and Interfaces in OPUS__: [Tiedemann, Jorg](mailto:jorg.tiedemann@lingfil.uu.se).
<!-- ### Licensing Information
ddd -->
### Citation Information
Please cite the following paper when using this dataset.
```latex
@inproceedings{tiedemann-2012-parallel,
title = Parallel Data, Tools and Interfaces in OPUS,
author = {
Tiedemann, Jorg
},
booktitle = "Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)",
month = may,
year = 2012,
address = Istanbul, Turkey,
publisher = European Language Resources Association (ELRA),
url = http://www.lrec-conf.org/proceedings/lrec2012/pdf/463_Paper.pdf,
pages = 2214--2218,
abstract = This paper presents the current status of OPUS, a growing language resource of parallel corpora and related tools. The focus in OPUS is to provide freely available data sets in various formats together with basic annotation to be useful for applications in computational linguistics, translation studies and cross-linguistic corpus studies. In this paper, we report about new data sets and their features, additional annotation tools and models provided from the website and essential interfaces and on-line services included in the project.,
}
```
提供机构:
qanastek
原始信息汇总
数据集概述
数据集名称
EMEA-V3
数据集描述
EMEA-V3 是一个用于神经机器翻译的平行语料库,由 Tiedemann, Jorg 在 OPUS 项目期间收集和校准。
支持的任务
- 翻译:该数据集可用于训练翻译模型。
语言
数据集包含来自欧盟的22种不同语言的源和目标句子对。
语言列表:
- Bulgarian (bg)
- Czech (cs)
- Danish (da)
- German (de)
- Greek (el)
- English (en)
- Spanish (es)
- Estonian (et)
- Finnish (fi)
- French (fr)
- Hungarian (hu)
- Italian (it)
- Lithuanian (lt)
- Latvian (lv)
- Maltese (mt)
- Dutch (nl)
- Polish (pl)
- Portuguese (pt)
- Romanian (ro)
- Slovak (sk)
- Slovenian (sl)
- Swedish (sv)
数据集结构
数据实例
数据集中的每个实例包含三种信息:
- lang:源和目标语言对,类型为
String。 - source_text:源文本,类型为
String。 - target_text:目标文本,类型为
String。
数据分割
数据集根据不同语言对进行了分割,具体分割情况请参考提供的表格。
数据集创建
源数据
数据集的所有数据由 Tiedemann, Jorg 上传至 Opus。
个人和敏感信息
数据集不包含个人或敏感信息。
使用数据时的考虑
数据集的翻译质量可能因任务的性质而有所不同。
搜集汇总
数据集介绍

背景与挑战
背景概述
EMEA-V3是一个由欧洲药品管理局提供的平行翻译语料库,包含22种欧盟语言的医疗领域文本,用于神经机器翻译任务。该数据集规模庞大,总行数超过8250万,覆盖多种语言对,如bg-cs、en-es等,专注于药物说明和医疗文档的翻译。
以上内容由遇见数据集搜集并总结生成



