multiun
收藏魔搭社区2025-12-05 更新2025-08-16 收录
下载链接:
https://modelscope.cn/datasets/Helsinki-NLP/multiun
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for OPUS MultiUN
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:** https://opus.nlpl.eu/MultiUN/corpus/version/MultiUN
- **Repository:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
- **Paper:** https://aclanthology.org/L10-1473/
- **Leaderboard:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
- **Point of Contact:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Dataset Summary
The MultiUN parallel corpus is extracted from the United Nations Website , and then cleaned and converted to XML at Language Technology Lab in DFKI GmbH (LT-DFKI), Germany. The documents were published by UN from 2000 to 2009.
This is a collection of translated documents from the United Nations originally compiled by Andreas Eisele and Yu Chen (see http://www.euromatrixplus.net/multi-un/).
This corpus is available in all 6 official languages of the UN consisting of around 300 million words per language
### Supported Tasks and Leaderboards
The underlying task is machine translation.
### Languages
Parallel texts are present in all six official languages: Arabic (`ar`), Chinese (`zh`), English (`en`), French (`fr`),
Russian (`ru`) and Spanish (`es`), with a small part of the documents available also in German (`de`).
## Dataset Structure
### Data Instances
```
{
"translation": {
"ar": "قرار اتخذته الجمعية العامة",
"de": "Resolution der Generalversammlung"
}
}
```
### Data Fields
- `translation` (`dict`): Parallel sentences for the pair of languages.
### Data Splits
The dataset contains a single "train" split for each language pair.
## Dataset Creation
### Curation Rationale
[More Information Needed]
### Source Data
#### Initial Data Collection and Normalization
Original MultiUN source data: http://www.euromatrixplus.net/multi-unp
#### Who are the source language producers?
[More Information Needed]
### Annotations
#### Annotation process
[More Information Needed]
#### Who are the annotators?
[More Information Needed]
### Personal and Sensitive Information
[More Information Needed]
## Considerations for Using the Data
### Social Impact of Dataset
[More Information Needed]
### Discussion of Biases
[More Information Needed]
### Other Known Limitations
[More Information Needed]
## Additional Information
### Dataset Curators
[More Information Needed]
### Licensing Information
[More Information Needed]
### Citation Information
If you use this corpus in your work, please cite the paper:
```
@inproceedings{eisele-chen-2010-multiun,
title = "{M}ulti{UN}: A Multilingual Corpus from United Nation Documents",
author = "Eisele, Andreas and
Chen, Yu",
booktitle = "Proceedings of the Seventh International Conference on Language Resources and Evaluation ({LREC}'10)",
month = may,
year = "2010",
address = "Valletta, Malta",
publisher = "European Language Resources Association (ELRA)",
url = "http://www.lrec-conf.org/proceedings/lrec2010/pdf/686_Paper.pdf",
abstract = "This paper describes the acquisition, preparation and properties of a corpus extracted from the official documents of the United Nations (UN). This corpus is available in all 6 official languages of the UN, consisting of around 300 million words per language. We describe the methods we used for crawling, document formatting, and sentence alignment. This corpus also includes a common test set for machine translation. We present the results of a French-Chinese machine translation experiment performed on this corpus.",
}
```
If you use any part of the corpus (hosted in OPUS) in your own work, please cite the following article:
```
@inproceedings{tiedemann-2012-parallel,
title = "Parallel Data, Tools and Interfaces in {OPUS}",
author = {Tiedemann, J{\"o}rg},
editor = "Calzolari, Nicoletta and
Choukri, Khalid and
Declerck, Thierry and
Do{\u{g}}an, Mehmet U{\u{g}}ur and
Maegaard, Bente and
Mariani, Joseph and
Moreno, Asuncion and
Odijk, Jan and
Piperidis, Stelios",
booktitle = "Proceedings of the Eighth International Conference on Language Resources and Evaluation ({LREC}'12)",
month = may,
year = "2012",
address = "Istanbul, Turkey",
publisher = "European Language Resources Association (ELRA)",
url = "http://www.lrec-conf.org/proceedings/lrec2012/pdf/463_Paper.pdf",
pages = "2214--2218",
abstract = "This paper presents the current status of OPUS, a growing language resource of parallel corpora and related tools. The focus in OPUS is to provide freely available data sets in various formats together with basic annotation to be useful for applications in computational linguistics, translation studies and cross-linguistic corpus studies. In this paper, we report about new data sets and their features, additional annotation tools and models provided from the website and essential interfaces and on-line services included in the project.",
}
```
### Contributions
Thanks to [@patil-suraj](https://github.com/patil-suraj) for adding this dataset.
# OPUS MultiUN 数据集卡片
## 目录
- [数据集描述](#dataset-description)
- [数据集概述](#dataset-summary)
- [支持任务与基准测试榜](#supported-tasks-and-leaderboards)
- [语言](#languages)
- [数据集结构](#dataset-structure)
- [数据实例](#data-instances)
- [数据字段](#data-fields)
- [数据划分](#data-splits)
- [数据集构建](#dataset-creation)
- [编纂依据](#curation-rationale)
- [源数据](#source-data)
- [标注信息](#annotations)
- [个人与敏感信息](#personal-and-sensitive-information)
- [数据集使用注意事项](#considerations-for-using-the-data)
- [数据集的社会影响](#social-impact-of-dataset)
- [偏差讨论](#discussion-of-biases)
- [其他已知局限性](#other-known-limitations)
- [附加信息](#additional-information)
- [数据集编纂者](#dataset-curators)
- [许可信息](#licensing-information)
- [引用信息](#citation-information)
- [贡献](#contributions)
## 数据集描述
- **主页**:https://opus.nlpl.eu/MultiUN/corpus/version/MultiUN
- **代码仓库**:[需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
- **相关论文**:https://aclanthology.org/L10-1473/
- **基准测试榜**:[需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
- **联系人**:[需补充更多信息](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 数据集概述
MultiUN平行语料库源自联合国官方网站,由德国DFKI GmbH语言技术实验室(LT-DFKI)完成清洗与XML格式转换。所收录文档均为联合国2000年至2009年间发布的内容。该语料库由Andreas Eisele与Chen Yu最初编纂,是联合国译制文档的集合(详见http://www.euromatrixplus.net/multi-un/)。此语料库涵盖联合国全部6种官方语言,每种语言包含约3亿个词元。
### 支持任务与基准测试榜
其核心任务为机器翻译。
### 语言
本语料库包含联合国全部6种官方语言的平行文本,分别为阿拉伯语(ar)、汉语(zh)、英语(en)、法语(fr)、俄语(ru)与西班牙语(es),另有少量德语(de)文档可供使用。
## 数据集结构
### 数据实例
{
"translation": {
"ar": "قرار اتخذته الجمعية العامة",
"de": "Resolution der Generalversammlung"
}
}
### 数据字段
- `translation`(字典类型):对应语言对的平行句对。
### 数据划分
数据集为每个语言对仅提供单个“训练(train)”划分。
## 数据集构建
### 编纂依据
[需补充更多信息]
### 源数据
#### 初始数据收集与标准化
原始MultiUN源数据:http://www.euromatrixplus.net/multi-unp
#### 源语言文本的创作者为谁?
[需补充更多信息]
### 标注信息
#### 标注流程
[需补充更多信息]
#### 标注人员为谁?
[需补充更多信息]
### 个人与敏感信息
[需补充更多信息]
## 数据集使用注意事项
### 数据集的社会影响
[需补充更多信息]
### 偏差讨论
[需补充更多信息]
### 其他已知局限性
[需补充更多信息]
## 附加信息
### 数据集编纂者
[需补充更多信息]
### 许可信息
[需补充更多信息]
### 引用信息
若您在研究中使用该语料库,请引用以下论文:
@inproceedings{eisele-chen-2010-multiun,
title = "{M}ulti{UN}: A Multilingual Corpus from United Nation Documents",
author = "Eisele, Andreas and
Chen, Yu",
booktitle = "Proceedings of the Seventh International Conference on Language Resources and Evaluation ({LREC}'10)",
month = may,
year = "2010",
address = "Valletta, Malta",
publisher = "European Language Resources Association (ELRA)",
url = "http://www.lrec-conf.org/proceedings/lrec2010/pdf/686_Paper.pdf",
abstract = "This paper describes the acquisition, preparation and properties of a corpus extracted from the official documents of the United Nations (UN). This corpus is available in all 6 official languages of the UN, consisting of around 300 million words per language. We describe the methods we used for crawling, document formatting, and sentence alignment. This corpus also includes a common test set for machine translation. We present the results of a French-Chinese machine translation experiment performed on this corpus.",
}
若您在研究中使用OPUS平台托管的该语料库的任意部分,请引用以下文章:
@inproceedings{tiedemann-2012-parallel,
title = "Parallel Data, Tools and Interfaces in {OPUS}",
author = {Tiedemann, J{"o}rg},
editor = "Calzolari, Nicoletta and
Choukri, Khalid and
Declerck, Thierry and
Do{u{g}}an, Mehmet U{u{g}}ur and
Maegaard, Bente and
Mariani, Joseph and
Moreno, Asuncion and
Odijk, Jan and
Piperidis, Stelios",
booktitle = "Proceedings of the Eighth International Conference on Language Resources and Evaluation ({LREC}'12)",
month = may,
year = "2012",
address = "Istanbul, Turkey",
publisher = "European Language Resources Association (ELRA)",
url = "http://www.lrec-conf.org/proceedings/lrec2012/pdf/463_Paper.pdf",
pages = "2214--2218",
abstract = "This paper presents the current status of OPUS, a growing language resource of parallel corpora and related tools. The focus in OPUS is to provide freely available data sets in various formats together with basic annotation to be useful for applications in computational linguistics, translation studies and cross-linguistic corpus studies. In this paper, we report about new data sets and their features, additional annotation tools and models provided from the website and essential interfaces and on-line services included in the project.",
}
### 贡献
感谢[@patil-suraj](https://github.com/patil-suraj)为本数据集添加支持。
提供机构:
maas
创建时间:
2025-08-16
搜集汇总
数据集介绍

以上内容由遇见数据集搜集并总结生成



