five

qanastek/WMT-16-PubMed

收藏
Hugging Face2022-10-22 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/qanastek/WMT-16-PubMed
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - machine-generated - expert-generated language_creators: - found language: - bg - cs - da - de - el - en - es - et - fi - fr - hu - it - lt - lv - mt - nl - pl - pt - ro - sk - sl - sv multilinguality: - multilingual pretty_name: WMT-16-PubMed size_categories: - 100K<n<1M source_datasets: - extended task_categories: - translation - machine-translation task_ids: - translation - machine-translation --- # WMT-16-PubMed : European parallel translation corpus from the European Medicines Agency ## Table of Contents - [Dataset Card for [Needs More Information]](#dataset-card-for-needs-more-information) - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Initial Data Collection and Normalization](#initial-data-collection-and-normalization) - [Who are the source language producers?](#who-are-the-source-language-producers) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) ## Dataset Description - **Homepage:** https://www.statmt.org/wmt16/biomedical-translation-task.html - **Repository:** https://github.com/biomedical-translation-corpora/corpora - **Paper:** https://aclanthology.org/W16-2301/ - **Leaderboard:** [Needs More Information] - **Point of Contact:** [Yanis Labrak](mailto:yanis.labrak@univ-avignon.fr) ### Dataset Summary `WMT-16-PubMed` is a parallel corpus for neural machine translation collected and aligned for ACL 2016 during the [WMT'16 Shared Task: Biomedical Translation Task](https://www.statmt.org/wmt16/biomedical-translation-task.html). ### Supported Tasks and Leaderboards `translation`: The dataset can be used to train a model for translation. ### Languages The corpora consists of a pair of source and target sentences for all 4 different languages : **List of languages :** `English (en)`,`Spanish (es)`,`French (fr)`,`Portuguese (pt)`. ## Load the dataset with HuggingFace ```python from datasets import load_dataset dataset = load_dataset("qanastek/WMT-16-PubMed", split='train', download_mode='force_redownload') print(dataset) print(dataset[0]) ``` ## Dataset Structure ### Data Instances ```plain lang doc_id workshop publisher source_text target_text 0 en-fr 26839447 WMT'16 Biomedical Translation Task - PubMed pubmed Global Health: Where Do Physiotherapy and Reha... La place des cheveux et des poils dans les rit... 1 en-fr 26837117 WMT'16 Biomedical Translation Task - PubMed pubmed Carabin Les Carabins 2 en-fr 26837116 WMT'16 Biomedical Translation Task - PubMed pubmed In Process Citation Le laboratoire d'Anatomie, Biomécanique et Org... 3 en-fr 26837115 WMT'16 Biomedical Translation Task - PubMed pubmed Comment on the misappropriation of bibliograph... Du détournement des références bibliographique... 4 en-fr 26837114 WMT'16 Biomedical Translation Task - PubMed pubmed Anti-aging medicine, a science-based, essentia... La médecine anti-âge, une médecine scientifiqu... ... ... ... ... ... ... ... 973972 en-pt 20274330 WMT'16 Biomedical Translation Task - PubMed pubmed Myocardial infarction, diagnosis and treatment Infarto do miocárdio; diagnóstico e tratamento 973973 en-pt 20274329 WMT'16 Biomedical Translation Task - PubMed pubmed The health areas politics A política dos campos de saúde 973974 en-pt 20274328 WMT'16 Biomedical Translation Task - PubMed pubmed The role in tissue edema and liquid exchanges ... O papel dos tecidos nos edemas e nas trocas lí... 973975 en-pt 20274327 WMT'16 Biomedical Translation Task - PubMed pubmed About suppuration of the wound after thoracopl... Sôbre as supurações da ferida operatória após ... 973976 en-pt 20274326 WMT'16 Biomedical Translation Task - PubMed pubmed Experimental study of liver lesions in the tre... Estudo experimental das lesões hepáticas no tr... ``` ### Data Fields **lang** : The pair of source and target language of type `String`. **source_text** : The source text of type `String`. **target_text** : The target text of type `String`. ### Data Splits `en-es` : 285,584 `en-fr` : 614,093 `en-pt` : 74,300 ## Dataset Creation ### Curation Rationale For details, check the corresponding [pages](https://www.statmt.org/wmt16/biomedical-translation-task.html). ### Source Data <!-- #### Initial Data Collection and Normalization ddd --> #### Who are the source language producers? The shared task as been organized by : * Antonio Jimeno Yepes (IBM Research Australia) * Aurélie Névéol (LIMSI, CNRS, France) * Mariana Neves (Hasso-Plattner Institute, Germany) * Karin Verspoor (University of Melbourne, Australia) ### Personal and Sensitive Information The corpora is free of personal or sensitive information. ## Considerations for Using the Data ### Other Known Limitations The nature of the task introduce a variability in the quality of the target translations. ## Additional Information ### Dataset Curators __Hugging Face WMT-16-PubMed__: Labrak Yanis, Dufour Richard (Not affiliated with the original corpus) __WMT'16 Shared Task: Biomedical Translation Task__: * Antonio Jimeno Yepes (IBM Research Australia) * Aurélie Névéol (LIMSI, CNRS, France) * Mariana Neves (Hasso-Plattner Institute, Germany) * Karin Verspoor (University of Melbourne, Australia) <!-- ### Licensing Information ddd --> ### Citation Information Please cite the following paper when using this dataset. ```latex @inproceedings{bojar-etal-2016-findings, title = Findings of the 2016 Conference on Machine Translation, author = { Bojar, Ondrej and Chatterjee, Rajen and Federmann, Christian and Graham, Yvette and Haddow, Barry and Huck, Matthias and Jimeno Yepes, Antonio and Koehn, Philipp and Logacheva, Varvara and Monz, Christof and Negri, Matteo and Neveol, Aurelie and Neves, Mariana and Popel, Martin and Post, Matt and Rubino, Raphael and Scarton, Carolina and Specia, Lucia and Turchi, Marco and Verspoor, Karin and Zampieri, Marcos, }, booktitle = Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers, month = aug, year = 2016, address = Berlin, Germany, publisher = Association for Computational Linguistics, url = https://aclanthology.org/W16-2301, doi = 10.18653/v1/W16-2301, pages = 131--198, } ```

annotations_creators: - 机器生成 - 专家生成 language_creators: - 公开获取 language: - 保加利亚语(bg) - 捷克语(cs) - 丹麦语(da) - 德语(de) - 希腊语(el) - 英语(en) - 西班牙语(es) - 爱沙尼亚语(et) - 芬兰语(fi) - 法语(fr) - 匈牙利语(hu) - 意大利语(it) - 立陶宛语(lt) - 拉脱维亚语(lv) - 马耳他语(mt) - 荷兰语(nl) - 波兰语(pl) - 葡萄牙语(pt) - 罗马尼亚语(ro) - 斯洛伐克语(sk) - 斯洛文尼亚语(sl) - 瑞典语(sv) multilinguality: - 多语言 pretty_name: WMT-16-PubMed size_categories: - 10万<n<100万 source_datasets: - 扩展数据集 task_categories: - 翻译 - 机器翻译 task_ids: - 翻译 - 机器翻译 # WMT-16-PubMed : 来自欧洲药品管理局的欧洲平行翻译语料库 ## 目录 - [数据集卡片(需补充更多信息)](#dataset-card-for-needs-more-information) - [目录](#table-of-contents) - [数据集描述](#dataset-description) - [数据集摘要](#dataset-summary) - [支持任务与评测榜单](#supported-tasks-and-leaderboards) - [语言覆盖](#languages) - [数据集结构](#dataset-structure) - [数据实例](#data-instances) - [数据字段](#data-fields) - [数据划分](#data-splits) - [数据集构建](#dataset-creation) - [数据集精选依据](#curation-rationale) - [源数据](#source-data) - [初始数据收集与标准化](#initial-data-collection-and-normalization) - [源语言内容创作者是谁?](#who-are-the-source-language-producers) - [个人与敏感信息](#personal-and-sensitive-information) - [数据集使用注意事项](#considerations-for-using-the-data) - [其他已知局限性](#other-known-limitations) - [附加信息](#additional-information) - [数据集维护者](#dataset-curators) - [授权信息](#licensing-information) - [引用信息](#citation-information) ## 数据集描述 - **主页**:https://www.statmt.org/wmt16/biomedical-translation-task.html - **代码仓库**:https://github.com/biomedical-translation-corpora/corpora - **相关论文**:https://aclanthology.org/W16-2301/ - **评测榜单**:[需补充更多信息] - **联系人**:[Yanis Labrak](mailto:yanis.labrak@univ-avignon.fr) ### 数据集摘要 `WMT-16-PubMed` 是一款面向神经机器翻译(neural machine translation)的平行语料库(parallel corpus),于ACL 2016期间为[WMT'16共享任务:生物医学翻译任务](https://www.statmt.org/wmt16/biomedical-translation-task.html)收集并对齐完成。 ### 支持任务与评测榜单 `翻译`:该数据集可用于训练翻译模型。 ### 语言覆盖 该语料库包含4种语言的源语言与目标语言句子对: **语言列表**:`英语(en)`、`西班牙语(es)`、`法语(fr)`、`葡萄牙语(pt)`。 ## 使用HuggingFace加载数据集 python from datasets import load_dataset dataset = load_dataset("qanastek/WMT-16-PubMed", split='train', download_mode='force_redownload') print(dataset) print(dataset[0]) ## 数据集结构 ### 数据实例 plain lang doc_id workshop publisher source_text target_text 0 en-fr 26839447 WMT'16 Biomedical Translation Task - PubMed pubmed Global Health: Where Do Physiotherapy and Reha... La place des cheveux et des poils dans les rit... 1 en-fr 26837117 WMT'16 Biomedical Translation Task - PubMed pubmed Carabin Les Carabins 2 en-fr 26837116 WMT'16 Biomedical Translation Task - PubMed pubmed In Process Citation Le laboratoire d'Anatomie, Biomécanique et Org... 3 en-fr 26837115 WMT'16 Biomedical Translation Task - PubMed pubmed Comment on the misappropriation of bibliograph... Du détournement des références bibliographique... 4 en-fr 26837114 WMT'16 Biomedical Translation Task - PubMed pubmed Anti-aging medicine, a science-based, essentia... La médecine anti-âge, une médecine scientifiqu... ... ... ... ... ... ... ... 973972 en-pt 20274330 WMT'16 Biomedical Translation Task - PubMed pubmed Myocardial infarction, diagnosis and treatment Infarto do miocárdio; diagnóstico e tratamento 973973 en-pt 20274329 WMT'16 Biomedical Translation Task - PubMed pubmed The health areas politics A política dos campos de saúde 973974 en-pt 20274328 WMT'16 Biomedical Translation Task - PubMed pubmed The role in tissue edema and liquid exchanges ... O papel dos tecidos nos edemas e nas trocas lí... 973975 en-pt 20274327 WMT'16 Biomedical Translation Task - PubMed pubmed About suppuration of the wound after thoracopl... Sôbre as supurações da ferida operatória após ... 973976 en-pt 20274326 WMT'16 Biomedical Translation Task - PubMed pubmed Experimental study of liver lesions in the tre... Estudo experimental das lesões hepáticas no tr... ### 数据字段 **lang**:源语言与目标语言对,数据类型为字符串(String)。 **source_text**:源文本,数据类型为字符串(String)。 **target_text**:目标文本,数据类型为字符串(String)。 ### 数据划分 `en-es`:285,584条 `en-fr`:614,093条 `en-pt`:74,300条 ## 数据集构建 ### 数据集精选依据 详细信息请查阅对应[页面](https://www.statmt.org/wmt16/biomedical-translation-task.html)。 ### 源数据 <!-- #### 初始数据收集与标准化 ddd --> #### 源语言内容创作者是谁? 本次共享任务由以下机构及人员组织: * Antonio Jimeno Yepes(澳大利亚IBM研究院) * Aurélie Névéol(法国国家科学研究中心LIMSI实验室) * Mariana Neves(德国哈索·普拉特纳研究所) * Karin Verspoor(澳大利亚墨尔本大学) ### 个人与敏感信息 该语料库不包含任何个人或敏感信息。 ## 数据集使用注意事项 ### 其他已知局限性 该任务的特性会导致目标翻译质量存在一定波动。 ## 附加信息 ### 数据集维护者 __Hugging Face WMT-16-PubMed__:Labrak Yanis、Dufour Richard(与原始语料库无隶属关系) __WMT'16共享任务:生物医学翻译任务__: * Antonio Jimeno Yepes(澳大利亚IBM研究院) * Aurélie Névéol(法国国家科学研究中心LIMSI实验室) * Mariana Neves(德国哈索·普拉特纳研究所) * Karin Verspoor(澳大利亚墨尔本大学) <!-- ### 授权信息 ddd --> ### 引用信息 使用该数据集时,请引用以下论文: latex @inproceedings{bojar-etal-2016-findings, title = "Findings of the 2016 Conference on Machine Translation", author = { Bojar, Ondrej and Chatterjee, Rajen and Federmann, Christian and Graham, Yvette and Haddow, Barry and Huck, Matthias and Jimeno Yepes, Antonio and Koehn, Philipp and Logacheva, Varvara and Monz, Christof and Negri, Matteo and Neveol, Aurelie and Neves, Mariana and Popel, Martin and Post, Matt and Rubino, Raphael and Scarton, Carolina and Specia, Lucia and Turchi, Marco and Verspoor, Karin and Zampieri, Marcos, }, booktitle = "Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers", month = aug, year = 2016, address = Berlin, Germany, publisher = Association for Computational Linguistics, url = https://aclanthology.org/W16-2301, doi = 10.18653/v1/W16-2301, pages = 131--198, }
提供机构:
qanastek
原始信息汇总

数据集概述

数据集名称

WMT-16-PubMed

数据集描述

WMT-16-PubMed 是一个用于神经机器翻译的平行语料库,由ACL 2016期间收集并对其进行对齐,用于WMT16共享任务:生物医学翻译任务

支持的任务

  • 翻译:该数据集可用于训练翻译模型。

语言

  • 英语 (en)
  • 西班牙语 (es)
  • 法语 (fr)
  • 葡萄牙语 (pt)

数据集结构

  • 数据实例:每个实例包含源语言和目标语言的文本。
  • 数据字段
    • lang:源语言和目标语言对,类型为String
    • source_text:源文本,类型为String
    • target_text:目标文本,类型为String
  • 数据分割
    • en-es : 285,584
    • en-fr : 614,093
    • en-pt : 74,300

数据集创建

  • 来源数据
    • 源语言生产者
      • Antonio Jimeno Yepes (IBM Research Australia)
      • Aurélie Névéol (LIMSI, CNRS, France)
      • Mariana Neves (Hasso-Plattner Institute, Germany)
      • Karin Verspoor (University of Melbourne, Australia)
  • 个人和敏感信息:该语料库不含个人或敏感信息。

使用数据注意事项

  • 已知限制:任务的性质引入了目标翻译质量的变异性。

引用信息

latex @inproceedings{bojar-etal-2016-findings, title = Findings of the 2016 Conference on Machine Translation, author = { Bojar, Ondrej and Chatterjee, Rajen and Federmann, Christian and Graham, Yvette and Haddow, Barry and Huck, Matthias and Jimeno Yepes, Antonio and Koehn, Philipp and Logacheva, Varvara and Monz, Christof and Negri, Matteo and Neveol, Aurelie and Neves, Mariana and Popel, Martin and Post, Matt and Rubino, Raphael and Scarton, Carolina and Specia, Lucia and Turchi, Marco and Verspoor, Karin and Zampieri, Marcos, }, booktitle = Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers, month = aug, year = 2016, address = Berlin, Germany, publisher = Association for Computational Linguistics, url = https://aclanthology.org/W16-2301, doi = 10.18653/v1/W16-2301, pages = 131--198, }

搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
WMT-16-PubMed是一个生物医学领域的平行翻译语料库,包含英语与西班牙语、法语、葡萄牙语之间的句子对,适用于神经机器翻译任务。该数据集是为ACL 2016的WMT'16共享任务而收集和整理的。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作