five

un_pc

收藏
魔搭社区2025-12-05 更新2025-08-23 收录
下载链接:
https://modelscope.cn/datasets/Helsinki-NLP/un_pc
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Card for United Nations Parallel Corpus ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** https://opus.nlpl.eu/UNPC/corpus/version/UNPC - **Repository:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **Paper:** https://aclanthology.org/L16-1561/ - **Leaderboard:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **Point of Contact:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Dataset Summary The United Nations Parallel Corpus is the first parallel corpus composed from United Nations documents published by the original data creator. The parallel corpus consists of manually translated UN documents from the last 25 years (1990 to 2014) for the six official UN languages, Arabic, Chinese, English, French, Russian, and Spanish. The corpus is freely available for download under a liberal license. ### Supported Tasks and Leaderboards The underlying task is machine translation. ### Languages The six official UN languages: Arabic, Chinese, English, French, Russian, and Spanish. ## Dataset Structure ### Data Instances [More Information Needed] ### Data Fields [More Information Needed] ### Data Splits [More Information Needed] ## Dataset Creation ### Curation Rationale [More Information Needed] ### Source Data #### Initial Data Collection and Normalization [More Information Needed] #### Who are the source language producers? [More Information Needed] ### Annotations #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information https://conferences.unite.un.org/UNCORPUS/#disclaimer The following disclaimer, an integral part of the United Nations Parallel Corpus, shall be respected with regard to the Corpus (no other restrictions apply): - The United Nations Parallel Corpus is made available without warranty of any kind, explicit or implied. The United Nations specifically makes no warranties or representations as to the accuracy or completeness of the information contained in the United Nations Corpus. - Under no circumstances shall the United Nations be liable for any loss, liability, injury or damage incurred or suffered that is claimed to have resulted from the use of the United Nations Corpus. The use of the United Nations Corpus is at the user's sole risk. The user specifically acknowledges and agrees that the United Nations is not liable for the conduct of any user. If the user is dissatisfied with any of the material provided in the United Nations Corpus, the user's sole and exclusive remedy is to discontinue using the United Nations Corpus. - When using the United Nations Corpus, the user must acknowledge the United Nations as the source of the information. For references, please cite this reference: Ziemski, M., Junczys-Dowmunt, M., and Pouliquen, B., (2016), The United Nations Parallel Corpus, Language Resources and Evaluation (LREC’16), Portorož, Slovenia, May 2016. - Nothing herein shall constitute or be considered to be a limitation upon or waiver, express or implied, of the privileges and immunities of the United Nations, which are specifically reserved. ### Citation Information ``` @inproceedings{ziemski-etal-2016-united, title = "The {U}nited {N}ations Parallel Corpus v1.0", author = "Ziemski, Micha{\\l} and Junczys-Dowmunt, Marcin and Pouliquen, Bruno", booktitle = "Proceedings of the Tenth International Conference on Language Resources and Evaluation ({LREC}'16)", month = may, year = "2016", address = "Portoro{\v{z}}, Slovenia", publisher = "European Language Resources Association (ELRA)", url = "https://www.aclweb.org/anthology/L16-1561", pages = "3530--3534", abstract = "This paper describes the creation process and statistics of the official United Nations Parallel Corpus, the first parallel corpus composed from United Nations documents published by the original data creator. The parallel corpus presented consists of manually translated UN documents from the last 25 years (1990 to 2014) for the six official UN languages, Arabic, Chinese, English, French, Russian, and Spanish. The corpus is freely available for download under a liberal license. Apart from the pairwise aligned documents, a fully aligned subcorpus for the six official UN languages is distributed. We provide baseline BLEU scores of our Moses-based SMT systems trained with the full data of language pairs involving English and for all possible translation directions of the six-way subcorpus.", } ``` ### Contributions Thanks to [@patil-suraj](https://github.com/patil-suraj) for adding this dataset.

# 联合国平行语料库数据集卡片(United Nations Parallel Corpus) ## 目录 - [数据集描述](#dataset-description) - [数据集概要](#dataset-summary) - [支持任务与排行榜](#supported-tasks-and-leaderboards) - [语言范围](#languages) - [数据集结构](#dataset-structure) - [数据实例](#data-instances) - [数据字段](#data-fields) - [数据划分](#data-splits) - [数据集构建](#dataset-creation) - [构建初衷](#curation-rationale) - [源数据](#source-data) - [标注信息](#annotations) - [个人与敏感信息](#personal-and-sensitive-information) - [数据集使用注意事项](#considerations-for-using-the-data) - [数据集的社会影响](#social-impact-of-dataset) - [偏差讨论](#discussion-of-biases) - [其他已知局限性](#other-known-limitations) - [附加信息](#additional-information) - [数据集维护者](#dataset-curators) - [许可信息](#licensing-information) - [引用信息](#citation-information) - [贡献致谢](#contributions) ## 数据集描述 - **主页:** https://opus.nlpl.eu/UNPC/corpus/version/UNPC - **代码仓库:** [需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **相关论文:** https://aclanthology.org/L16-1561/ - **排行榜:** [需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **联系方式:** [需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 数据集概要 联合国平行语料库(United Nations Parallel Corpus)是首个由原始数据创建者发布的联合国官方文档构成的平行语料库。该平行语料库包含1990年至2014年近25年间的联合国官方文档人工译版,涵盖联合国六大官方语言:阿拉伯语、汉语、英语、法语、俄语、西班牙语。本语料库可在宽松许可协议下免费下载使用。 ### 支持任务与排行榜 其支撑任务为机器翻译。 ### 语言范围 联合国六大官方语言:阿拉伯语、汉语、英语、法语、俄语、西班牙语。 ## 数据集结构 ### 数据实例 [需补充更多信息] ### 数据字段 [需补充更多信息] ### 数据划分 [需补充更多信息] ## 数据集构建 ### 构建初衷 [需补充更多信息] ### 源数据 #### 初始数据收集与归一化 [需补充更多信息] #### 源语言文本的创作者是谁? [需补充更多信息] ### 标注信息 #### 标注流程 [需补充更多信息] #### 标注人员是谁? [需补充更多信息] ### 个人与敏感信息 [需补充更多信息] ## 数据集使用注意事项 ### 数据集的社会影响 [需补充更多信息] ### 偏差讨论 [需补充更多信息] ### 其他已知局限性 [需补充更多信息] ## 附加信息 ### 数据集维护者 [需补充更多信息] ### 许可信息 https://conferences.unite.un.org/UNCORPUS/#disclaimer 以下免责声明为联合国平行语料库不可分割的组成部分,使用本语料库时需严格遵守(无其他额外限制): - 联合国平行语料库按“现状”提供,无任何明示或默示的担保。联合国明确不对本语料库中信息的准确性或完整性作出任何担保或陈述。 - 无论在何种情况下,联合国均不对因使用本语料库而声称导致的任何损失、责任、伤害或损害承担责任。使用本语料库的风险由用户自行承担。用户明确承认并同意,联合国不对任何用户的行为负责。若用户对本语料库提供的任何材料不满意,其唯一且排他的救济方式是停止使用本语料库。 - 使用本语料库时,用户必须注明联合国为信息来源。如需引用,请使用以下参考文献:Ziemski, M., Junczys-Dowmunt, M., and Pouliquen, B., (2016), The United Nations Parallel Corpus, Language Resources and Evaluation (LREC’16), 斯洛文尼亚波托鲁日,2016年5月。 - 本文件中的任何内容均不得被视为限制或放弃联合国的特权与豁免权,联合国对此明确保留所有权利。 ### 引用信息 @inproceedings{ziemski-etal-2016-united, title = "The {U}nited {N}ations Parallel Corpus v1.0", author = "Ziemski, Micha{\l} and Junczys-Dowmunt, Marcin and Pouliquen, Bruno", booktitle = "Proceedings of the Tenth International Conference on Language Resources and Evaluation ({LREC}'16)", month = may, year = "2016", address = "Portoro{\v{z}}, Slovenia", publisher = "European Language Resources Association (ELRA)", url = "https://www.aclweb.org/anthology/L16-1561", pages = "3530--3534", abstract = "This paper describes the creation process and statistics of the official United Nations Parallel Corpus, the first parallel corpus composed from United Nations documents published by the original data creator. The parallel corpus presented consists of manually translated UN documents from the last 25 years (1990 to 2014) for the six official UN languages, Arabic, Chinese, English, French, Russian, and Spanish. The corpus is freely available for download under a liberal license. Apart from the pairwise aligned documents, a fully aligned subcorpus for the six official UN languages is distributed. We provide baseline BLEU scores of our Moses-based SMT systems trained with the full data of language pairs involving English and for all possible translation directions of the six-way subcorpus.", } ### 贡献致谢 感谢[@patil-suraj](https://github.com/patil-suraj)为本数据集添加相关内容。
提供机构:
maas
创建时间:
2025-08-16
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作