---
annotations_creators:
- machine-generated
- expert-generated
language_creators:
- found
language:
- bg
- cs
- da
- de
- el
- en
- es
- et
- fi
- fr
- hu
- it
- lt
- lv
- mt
- nl
- pl
- pt
- ro
- sk
- sl
- sv
multilinguality:
- multilingual
pretty_name: WMT-16-PubMed
size_categories:
- 100K<n<1M
source_datasets:
- extended
task_categories:
- translation
- machine-translation
task_ids:
- translation
- machine-translation
---
# WMT-16-PubMed : European parallel translation corpus from the European Medicines Agency
## Table of Contents
- [Dataset Card for [Needs More Information]](#dataset-card-for-needs-more-information)
- [Table of Contents](#table-of-contents)
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Initial Data Collection and Normalization](#initial-data-collection-and-normalization)
- [Who are the source language producers?](#who-are-the-source-language-producers)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
## Dataset Description
- **Homepage:** https://www.statmt.org/wmt16/biomedical-translation-task.html
- **Repository:** https://github.com/biomedical-translation-corpora/corpora
- **Paper:** https://aclanthology.org/W16-2301/
- **Leaderboard:** [Needs More Information]
- **Point of Contact:** [Yanis Labrak](mailto:yanis.labrak@univ-avignon.fr)
### Dataset Summary
`WMT-16-PubMed` is a parallel corpus for neural machine translation collected and aligned for ACL 2016 during the [WMT'16 Shared Task: Biomedical Translation Task](https://www.statmt.org/wmt16/biomedical-translation-task.html).
### Supported Tasks and Leaderboards
`translation`: The dataset can be used to train a model for translation.
### Languages
The corpora consists of a pair of source and target sentences for all 4 different languages :
**List of languages :** `English (en)`,`Spanish (es)`,`French (fr)`,`Portuguese (pt)`.
## Load the dataset with HuggingFace
```python
from datasets import load_dataset
dataset = load_dataset("qanastek/WMT-16-PubMed", split='train', download_mode='force_redownload')
print(dataset)
print(dataset[0])
```
## Dataset Structure
### Data Instances
```plain
lang doc_id workshop publisher source_text target_text
0 en-fr 26839447 WMT'16 Biomedical Translation Task - PubMed pubmed Global Health: Where Do Physiotherapy and Reha... La place des cheveux et des poils dans les rit...
1 en-fr 26837117 WMT'16 Biomedical Translation Task - PubMed pubmed Carabin Les Carabins
2 en-fr 26837116 WMT'16 Biomedical Translation Task - PubMed pubmed In Process Citation Le laboratoire d'Anatomie, Biomécanique et Org...
3 en-fr 26837115 WMT'16 Biomedical Translation Task - PubMed pubmed Comment on the misappropriation of bibliograph... Du détournement des références bibliographique...
4 en-fr 26837114 WMT'16 Biomedical Translation Task - PubMed pubmed Anti-aging medicine, a science-based, essentia... La médecine anti-âge, une médecine scientifiqu...
... ... ... ... ... ... ...
973972 en-pt 20274330 WMT'16 Biomedical Translation Task - PubMed pubmed Myocardial infarction, diagnosis and treatment Infarto do miocárdio; diagnóstico e tratamento
973973 en-pt 20274329 WMT'16 Biomedical Translation Task - PubMed pubmed The health areas politics A política dos campos de saúde
973974 en-pt 20274328 WMT'16 Biomedical Translation Task - PubMed pubmed The role in tissue edema and liquid exchanges ... O papel dos tecidos nos edemas e nas trocas lí...
973975 en-pt 20274327 WMT'16 Biomedical Translation Task - PubMed pubmed About suppuration of the wound after thoracopl... Sôbre as supurações da ferida operatória após ...
973976 en-pt 20274326 WMT'16 Biomedical Translation Task - PubMed pubmed Experimental study of liver lesions in the tre... Estudo experimental das lesões hepáticas no tr...
```
### Data Fields
**lang** : The pair of source and target language of type `String`.
**source_text** : The source text of type `String`.
**target_text** : The target text of type `String`.
### Data Splits
`en-es` : 285,584
`en-fr` : 614,093
`en-pt` : 74,300
## Dataset Creation
### Curation Rationale
For details, check the corresponding [pages](https://www.statmt.org/wmt16/biomedical-translation-task.html).
### Source Data
<!-- #### Initial Data Collection and Normalization
ddd -->
#### Who are the source language producers?
The shared task as been organized by :
* Antonio Jimeno Yepes (IBM Research Australia)
* Aurélie Névéol (LIMSI, CNRS, France)
* Mariana Neves (Hasso-Plattner Institute, Germany)
* Karin Verspoor (University of Melbourne, Australia)
### Personal and Sensitive Information
The corpora is free of personal or sensitive information.
## Considerations for Using the Data
### Other Known Limitations
The nature of the task introduce a variability in the quality of the target translations.
## Additional Information
### Dataset Curators
__Hugging Face WMT-16-PubMed__: Labrak Yanis, Dufour Richard (Not affiliated with the original corpus)
__WMT'16 Shared Task: Biomedical Translation Task__:
* Antonio Jimeno Yepes (IBM Research Australia)
* Aurélie Névéol (LIMSI, CNRS, France)
* Mariana Neves (Hasso-Plattner Institute, Germany)
* Karin Verspoor (University of Melbourne, Australia)
<!-- ### Licensing Information
ddd -->
### Citation Information
Please cite the following paper when using this dataset.
```latex
@inproceedings{bojar-etal-2016-findings,
title = Findings of the 2016 Conference on Machine Translation,
author = {
Bojar, Ondrej and
Chatterjee, Rajen and
Federmann, Christian and
Graham, Yvette and
Haddow, Barry and
Huck, Matthias and
Jimeno Yepes, Antonio and
Koehn, Philipp and
Logacheva, Varvara and
Monz, Christof and
Negri, Matteo and
Neveol, Aurelie and
Neves, Mariana and
Popel, Martin and
Post, Matt and
Rubino, Raphael and
Scarton, Carolina and
Specia, Lucia and
Turchi, Marco and
Verspoor, Karin and
Zampieri, Marcos,
},
booktitle = Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers,
month = aug,
year = 2016,
address = Berlin, Germany,
publisher = Association for Computational Linguistics,
url = https://aclanthology.org/W16-2301,
doi = 10.18653/v1/W16-2301,
pages = 131--198,
}
```
annotations_creators:
- 机器生成
- 专家生成
language_creators:
- 公开获取
language:
- 保加利亚语(bg)
- 捷克语(cs)
- 丹麦语(da)
- 德语(de)
- 希腊语(el)
- 英语(en)
- 西班牙语(es)
- 爱沙尼亚语(et)
- 芬兰语(fi)
- 法语(fr)
- 匈牙利语(hu)
- 意大利语(it)
- 立陶宛语(lt)
- 拉脱维亚语(lv)
- 马耳他语(mt)
- 荷兰语(nl)
- 波兰语(pl)
- 葡萄牙语(pt)
- 罗马尼亚语(ro)
- 斯洛伐克语(sk)
- 斯洛文尼亚语(sl)
- 瑞典语(sv)
multilinguality:
- 多语言
pretty_name: WMT-16-PubMed
size_categories:
- 10万<n<100万
source_datasets:
- 扩展数据集
task_categories:
- 翻译
- 机器翻译
task_ids:
- 翻译
- 机器翻译
# WMT-16-PubMed : 来自欧洲药品管理局的欧洲平行翻译语料库
## 目录
- [数据集卡片(需补充更多信息)](#dataset-card-for-needs-more-information)
- [目录](#table-of-contents)
- [数据集描述](#dataset-description)
- [数据集摘要](#dataset-summary)
- [支持任务与评测榜单](#supported-tasks-and-leaderboards)
- [语言覆盖](#languages)
- [数据集结构](#dataset-structure)
- [数据实例](#data-instances)
- [数据字段](#data-fields)
- [数据划分](#data-splits)
- [数据集构建](#dataset-creation)
- [数据集精选依据](#curation-rationale)
- [源数据](#source-data)
- [初始数据收集与标准化](#initial-data-collection-and-normalization)
- [源语言内容创作者是谁?](#who-are-the-source-language-producers)
- [个人与敏感信息](#personal-and-sensitive-information)
- [数据集使用注意事项](#considerations-for-using-the-data)
- [其他已知局限性](#other-known-limitations)
- [附加信息](#additional-information)
- [数据集维护者](#dataset-curators)
- [授权信息](#licensing-information)
- [引用信息](#citation-information)
## 数据集描述
- **主页**:https://www.statmt.org/wmt16/biomedical-translation-task.html
- **代码仓库**:https://github.com/biomedical-translation-corpora/corpora
- **相关论文**:https://aclanthology.org/W16-2301/
- **评测榜单**:[需补充更多信息]
- **联系人**:[Yanis Labrak](mailto:yanis.labrak@univ-avignon.fr)
### 数据集摘要
`WMT-16-PubMed` 是一款面向神经机器翻译(neural machine translation)的平行语料库(parallel corpus),于ACL 2016期间为[WMT'16共享任务:生物医学翻译任务](https://www.statmt.org/wmt16/biomedical-translation-task.html)收集并对齐完成。
### 支持任务与评测榜单
`翻译`:该数据集可用于训练翻译模型。
### 语言覆盖
该语料库包含4种语言的源语言与目标语言句子对:
**语言列表**:`英语(en)`、`西班牙语(es)`、`法语(fr)`、`葡萄牙语(pt)`。
## 使用HuggingFace加载数据集
python
from datasets import load_dataset
dataset = load_dataset("qanastek/WMT-16-PubMed", split='train', download_mode='force_redownload')
print(dataset)
print(dataset[0])
## 数据集结构
### 数据实例
plain
lang doc_id workshop publisher source_text target_text
0 en-fr 26839447 WMT'16 Biomedical Translation Task - PubMed pubmed Global Health: Where Do Physiotherapy and Reha... La place des cheveux et des poils dans les rit...
1 en-fr 26837117 WMT'16 Biomedical Translation Task - PubMed pubmed Carabin Les Carabins
2 en-fr 26837116 WMT'16 Biomedical Translation Task - PubMed pubmed In Process Citation Le laboratoire d'Anatomie, Biomécanique et Org...
3 en-fr 26837115 WMT'16 Biomedical Translation Task - PubMed pubmed Comment on the misappropriation of bibliograph... Du détournement des références bibliographique...
4 en-fr 26837114 WMT'16 Biomedical Translation Task - PubMed pubmed Anti-aging medicine, a science-based, essentia... La médecine anti-âge, une médecine scientifiqu...
... ... ... ... ... ... ...
973972 en-pt 20274330 WMT'16 Biomedical Translation Task - PubMed pubmed Myocardial infarction, diagnosis and treatment Infarto do miocárdio; diagnóstico e tratamento
973973 en-pt 20274329 WMT'16 Biomedical Translation Task - PubMed pubmed The health areas politics A política dos campos de saúde
973974 en-pt 20274328 WMT'16 Biomedical Translation Task - PubMed pubmed The role in tissue edema and liquid exchanges ... O papel dos tecidos nos edemas e nas trocas lí...
973975 en-pt 20274327 WMT'16 Biomedical Translation Task - PubMed pubmed About suppuration of the wound after thoracopl... Sôbre as supurações da ferida operatória após ...
973976 en-pt 20274326 WMT'16 Biomedical Translation Task - PubMed pubmed Experimental study of liver lesions in the tre... Estudo experimental das lesões hepáticas no tr...
### 数据字段
**lang**:源语言与目标语言对,数据类型为字符串(String)。
**source_text**:源文本,数据类型为字符串(String)。
**target_text**:目标文本,数据类型为字符串(String)。
### 数据划分
`en-es`:285,584条
`en-fr`:614,093条
`en-pt`:74,300条
## 数据集构建
### 数据集精选依据
详细信息请查阅对应[页面](https://www.statmt.org/wmt16/biomedical-translation-task.html)。
### 源数据
<!-- #### 初始数据收集与标准化
ddd -->
#### 源语言内容创作者是谁?
本次共享任务由以下机构及人员组织:
* Antonio Jimeno Yepes(澳大利亚IBM研究院)
* Aurélie Névéol(法国国家科学研究中心LIMSI实验室)
* Mariana Neves(德国哈索·普拉特纳研究所)
* Karin Verspoor(澳大利亚墨尔本大学)
### 个人与敏感信息
该语料库不包含任何个人或敏感信息。
## 数据集使用注意事项
### 其他已知局限性
该任务的特性会导致目标翻译质量存在一定波动。
## 附加信息
### 数据集维护者
__Hugging Face WMT-16-PubMed__:Labrak Yanis、Dufour Richard(与原始语料库无隶属关系)
__WMT'16共享任务:生物医学翻译任务__:
* Antonio Jimeno Yepes(澳大利亚IBM研究院)
* Aurélie Névéol(法国国家科学研究中心LIMSI实验室)
* Mariana Neves(德国哈索·普拉特纳研究所)
* Karin Verspoor(澳大利亚墨尔本大学)
<!-- ### 授权信息
ddd -->
### 引用信息
使用该数据集时,请引用以下论文:
latex
@inproceedings{bojar-etal-2016-findings,
title = "Findings of the 2016 Conference on Machine Translation",
author = {
Bojar, Ondrej and
Chatterjee, Rajen and
Federmann, Christian and
Graham, Yvette and
Haddow, Barry and
Huck, Matthias and
Jimeno Yepes, Antonio and
Koehn, Philipp and
Logacheva, Varvara and
Monz, Christof and
Negri, Matteo and
Neveol, Aurelie and
Neves, Mariana and
Popel, Martin and
Post, Matt and
Rubino, Raphael and
Scarton, Carolina and
Specia, Lucia and
Turchi, Marco and
Verspoor, Karin and
Zampieri, Marcos,
},
booktitle = "Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers",
month = aug,
year = 2016,
address = Berlin, Germany,
publisher = Association for Computational Linguistics,
url = https://aclanthology.org/W16-2301,
doi = 10.18653/v1/W16-2301,
pages = 131--198,
}