five

europarl

收藏
魔搭社区2025-12-05 更新2025-08-23 收录
下载链接:
https://modelscope.cn/datasets/Helsinki-NLP/europarl
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Card for OPUS Europarl (European Parliament Proceedings Parallel Corpus) ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** https://opus.nlpl.eu/Europarl/corpus/version/Europarl - **Homepage:** https://www.statmt.org/europarl/ - **Repository:** [OPUS Europarl](https://opus.nlpl.eu/Europarl.php) - **Paper:** https://aclanthology.org/2005.mtsummit-papers.11/ - **Paper:** https://aclanthology.org/L12-1246/ - **Leaderboard:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **Point of Contact:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Dataset Summary A parallel corpus extracted from the European Parliament web site by Philipp Koehn (University of Edinburgh). The main intended use is to aid statistical machine translation research. More information can be found at http://www.statmt.org/europarl/ ### Supported Tasks and Leaderboards Tasks: Machine Translation, Cross Lingual Word Embeddings (CWLE) Alignment ### Languages - 21 languages, 211 bitexts - total number of files: 207,775 - total number of tokens: 759.05M - total number of sentence fragments: 30.32M Every pair of the following languages is available: - bg - cs - da - de - el - en - es - et - fi - fr - hu - it - lt - lv - nl - pl - pt - ro - sk - sl - sv ## Dataset Structure ### Data Instances Here is an example from the en-fr pair: ``` { 'translation': { 'en': 'Resumption of the session', 'fr': 'Reprise de la session' } } ``` ### Data Fields - `translation`: a dictionary containing two strings paired with a key indicating the corresponding language. ### Data Splits - `train`: only train split is provided. Authors did not provide a separation of examples in `train`, `dev` and `test`. ## Dataset Creation ### Curation Rationale [Needs More Information] ### Source Data #### Initial Data Collection and Normalization [Needs More Information] #### Who are the source language producers? [Needs More Information] ### Annotations #### Annotation process [Needs More Information] #### Who are the annotators? [Needs More Information] ### Personal and Sensitive Information [Needs More Information] ## Considerations for Using the Data ### Social Impact of Dataset [Needs More Information] ### Discussion of Biases [Needs More Information] ### Other Known Limitations [Needs More Information] ## Additional Information ### Dataset Curators [Needs More Information] ### Licensing Information The data set comes with the same license as the original sources. Please, check the information about the source that is given on https://opus.nlpl.eu/Europarl/corpus/version/Europarl The terms of use of the original source dataset are: > We are not aware of any copyright restrictions of the material. If you use this data in your research, please contact phi@jhu.edu. ### Citation Information Please cite the paper, if you use this corpus in your work: ``` @inproceedings{koehn-2005-europarl, title = "{E}uroparl: A Parallel Corpus for Statistical Machine Translation", author = "Koehn, Philipp", booktitle = "Proceedings of Machine Translation Summit X: Papers", month = sep # " 13-15", year = "2005", address = "Phuket, Thailand", url = "https://aclanthology.org/2005.mtsummit-papers.11", pages = "79--86", } ``` Please cite the following article if you use any part of the corpus in your own work: ``` @inproceedings{tiedemann-2012-parallel, title = "Parallel Data, Tools and Interfaces in {OPUS}", author = {Tiedemann, J{\"o}rg}, editor = "Calzolari, Nicoletta and Choukri, Khalid and Declerck, Thierry and Do{\u{g}}an, Mehmet U{\u{g}}ur and Maegaard, Bente and Mariani, Joseph and Moreno, Asuncion and Odijk, Jan and Piperidis, Stelios", booktitle = "Proceedings of the Eighth International Conference on Language Resources and Evaluation ({LREC}'12)", month = may, year = "2012", address = "Istanbul, Turkey", publisher = "European Language Resources Association (ELRA)", url = "http://www.lrec-conf.org/proceedings/lrec2012/pdf/463_Paper.pdf", pages = "2214--2218", } ``` ### Contributions Thanks to [@lucadiliello](https://github.com/lucadiliello) for adding this dataset.

# OPUS Europarl(欧洲议会会议平行语料库)数据集卡片 ## 目录 - [数据集描述](#dataset-description) - [数据集摘要](#dataset-summary) - [支持的任务与排行榜](#supported-tasks-and-leaderboards) - [语言](#languages) - [数据集结构](#dataset-structure) - [数据实例](#data-instances) - [数据字段](#data-fields) - [数据划分](#data-splits) - [数据集构建](#dataset-creation) - [构建初衷](#curation-rationale) - [源数据](#source-data) - [标注信息](#annotations) - [个人与敏感信息](#personal-and-sensitive-information) - [数据集使用注意事项](#considerations-for-using-the-data) - [数据集的社会影响](#social-impact-of-dataset) - [偏差讨论](#discussion-of-biases) - [其他已知局限性](#other-known-limitations) - [附加信息](#additional-information) - [数据集策展人](#dataset-curators) - [许可信息](#licensing-information) - [引用信息](#citation-information) - [贡献声明](#contributions) ## 数据集描述 - **主页:** https://opus.nlpl.eu/Europarl/corpus/version/Europarl - **主页:** https://www.statmt.org/europarl/ - **仓库:** [OPUS Europarl](https://opus.nlpl.eu/Europarl.php) - **论文:** https://aclanthology.org/2005.mtsummit-papers.11/ - **论文:** https://aclanthology.org/L12-1246/ - **排行榜:** [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **联系人:** [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 数据集摘要 该平行语料库(parallel corpus)由爱丁堡大学的Philipp Koehn从欧洲议会网站提取而来,主要用于助力统计机器翻译(statistical machine translation)研究。更多详细信息可访问 http://www.statmt.org/europarl/。 ### 支持的任务与排行榜 任务:机器翻译,跨语言词嵌入(Cross Lingual Word Embeddings, CWLE)对齐 ### 语言 - 涵盖21种语言,共211个双语文本对(bitexts) - 文件总数:207,775 - 词元(Token)总数:759.05M - 句子片段总数:30.32M 以下所有语言两两配对均可用: - bg(保加利亚语) - cs(捷克语) - da(丹麦语) - de(德语) - el(希腊语) - en(英语) - es(西班牙语) - et(爱沙尼亚语) - fi(芬兰语) - fr(法语) - hu(匈牙利语) - it(意大利语) - lt(立陶宛语) - lv(拉脱维亚语) - nl(荷兰语) - pl(波兰语) - pt(葡萄牙语) - ro(罗马尼亚语) - sk(斯洛伐克语) - sl(斯洛文尼亚语) - sv(瑞典语) ## 数据集结构 ### 数据实例 以下是英语-法语语言对的示例: { 'translation': { 'en': '会议复会', 'fr': 'Reprise de la session' } } ### 数据字段 - `translation`:一个字典,包含两个字符串,以对应语言代码作为键。 ### 数据划分 - `train`:仅提供训练划分。作者未提供训练集、验证集与测试集的划分。 ## 数据集构建 ### 构建初衷 [需补充更多信息] ### 源数据 #### 初始数据收集与标准化 [需补充更多信息] #### 源语言生产者是谁? [需补充更多信息] ### 标注信息 #### 标注流程 [需补充更多信息] #### 标注者是谁? [需补充更多信息] ### 个人与敏感信息 [需补充更多信息] ## 数据集使用注意事项 ### 数据集的社会影响 [需补充更多信息] ### 偏差讨论 [需补充更多信息] ### 其他已知局限性 [需补充更多信息] ## 附加信息 ### 数据集策展人 [需补充更多信息] ### 许可信息 本数据集采用与原始数据源一致的许可协议。有关详细许可信息,请访问 https://opus.nlpl.eu/Europarl/corpus/version/Europarl。 原始源数据集的使用条款如下: > 我们未发现该材料存在任何版权限制。若您在研究中使用此数据集,请联系 phi@jhu.edu。 ### 引用信息 若您在研究中使用该语料库,请引用以下论文: @inproceedings{koehn-2005-europarl, title = "{E}uroparl:用于统计机器翻译的平行语料库", author = "Koehn, Philipp", booktitle = "第10届机器翻译峰会论文集", month = sep # " 13-15", year = "2005", address = "泰国普吉岛", url = "https://aclanthology.org/2005.mtsummit-papers.11", pages = "79--86", } 若您在研究中使用该语料库的任意部分,请引用以下文章: @inproceedings{tiedemann-2012-parallel, title = "OPUS中的平行数据、工具与接口", author = {Tiedemann, J{"o}rg}, editor = "Calzolari, Nicoletta and Choukri, Khalid and Declerck, Thierry and Do{u{g}}an, Mehmet U{u{g}}ur and Maegaard, Bente and Mariani, Joseph and Moreno, Asuncion and Odijk, Jan and Piperidis, Stelios", booktitle = "第八届国际语言资源与评估会议(LREC'12)论文集", month = may, year = "2012", address = "土耳其伊斯坦布尔", publisher = "欧洲语言资源协会(ELRA)", url = "http://www.lrec-conf.org/proceedings/lrec2012/pdf/463_Paper.pdf", pages = "2214--2218", } ### 贡献声明 感谢 [@lucadiliello](https://github.com/lucadiliello) 为本数据集添加的贡献。
提供机构:
maas
创建时间:
2025-08-16
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作