five

Jorathan/multiun

收藏
Hugging Face2026-01-27 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Jorathan/multiun
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - found language_creators: - found language: - ar - de - en - es - fr - ru - zh license: - unknown multilinguality: - multilingual size_categories: - 100K<n<1M source_datasets: - original task_categories: - translation task_ids: [] paperswithcode_id: multiun pretty_name: MultiUN (Multilingual Corpus from United Nation Documents) config_names: - ar-de - ar-en - ar-es - ar-fr - ar-ru - ar-zh - de-en - de-es - de-fr - de-ru - de-zh - en-es - en-fr - en-ru - en-zh - es-fr - es-ru - es-zh - fr-ru - fr-zh - ru-zh dataset_info: - config_name: ar-de features: - name: translation dtype: translation: languages: - ar - de splits: - name: train num_bytes: 94466261 num_examples: 165090 download_size: 41124373 dataset_size: 94466261 - config_name: ar-en features: - name: translation dtype: translation: languages: - ar - en splits: - name: train num_bytes: 4189844561 num_examples: 9759125 download_size: 1926776740 dataset_size: 4189844561 - config_name: ar-es features: - name: translation dtype: translation: languages: - ar - es splits: - name: train num_bytes: 4509667188 num_examples: 10119379 download_size: 2069474168 dataset_size: 4509667188 - config_name: ar-fr features: - name: translation dtype: translation: languages: - ar - fr splits: - name: train num_bytes: 4516842065 num_examples: 9929567 download_size: 2083442998 dataset_size: 4516842065 - config_name: ar-ru features: - name: translation dtype: translation: languages: - ar - ru splits: - name: train num_bytes: 5932858699 num_examples: 10206243 download_size: 2544128334 dataset_size: 5932858699 - config_name: ar-zh features: - name: translation dtype: translation: languages: - ar - zh splits: - name: train num_bytes: 3781650541 num_examples: 9832293 download_size: 1829880809 dataset_size: 3781650541 - config_name: de-en features: - name: translation dtype: translation: languages: - de - en splits: - name: train num_bytes: 76684413 num_examples: 162981 download_size: 35105094 dataset_size: 76684413 - config_name: de-es features: - name: translation dtype: translation: languages: - de - es splits: - name: train num_bytes: 80936517 num_examples: 162078 download_size: 37042740 dataset_size: 80936517 - config_name: de-fr features: - name: translation dtype: translation: languages: - de - fr splits: - name: train num_bytes: 81888299 num_examples: 164025 download_size: 37827000 dataset_size: 81888299 - config_name: de-ru features: - name: translation dtype: translation: languages: - de - ru splits: - name: train num_bytes: 111517798 num_examples: 164792 download_size: 46723695 dataset_size: 111517798 - config_name: de-zh features: - name: translation dtype: translation: languages: - de - zh splits: - name: train num_bytes: 70534674 num_examples: 176933 download_size: 34964647 dataset_size: 70534674 - config_name: en-es features: - name: translation dtype: translation: languages: - en - es splits: - name: train num_bytes: 4128132575 num_examples: 11350967 download_size: 2030826335 dataset_size: 4128132575 - config_name: en-fr features: - name: translation dtype: translation: languages: - en - fr splits: - name: train num_bytes: 4678044616 num_examples: 13172019 download_size: 2312275443 dataset_size: 4678044616 - config_name: en-ru features: - name: translation dtype: translation: languages: - en - ru splits: - name: train num_bytes: 5632653511 num_examples: 11654416 download_size: 2523567444 dataset_size: 5632653511 - config_name: en-zh features: - name: translation dtype: translation: languages: - en - zh splits: - name: train num_bytes: 2960368390 num_examples: 9564315 download_size: 1557547095 dataset_size: 2960368390 - config_name: es-fr features: - name: translation dtype: translation: languages: - es - fr splits: - name: train num_bytes: 4454703338 num_examples: 11441889 download_size: 2187539838 dataset_size: 4454703338 - config_name: es-ru features: - name: translation dtype: translation: languages: - es - ru splits: - name: train num_bytes: 5442647242 num_examples: 10605056 download_size: 2432480744 dataset_size: 5442647242 - config_name: es-zh features: - name: translation dtype: translation: languages: - es - zh splits: - name: train num_bytes: 3223863318 num_examples: 9847770 download_size: 1676774308 dataset_size: 3223863318 - config_name: fr-ru features: - name: translation dtype: translation: languages: - fr - ru splits: - name: train num_bytes: 5979869673 num_examples: 11761738 download_size: 2690520032 dataset_size: 5979869673 - config_name: fr-zh features: - name: translation dtype: translation: languages: - fr - zh splits: - name: train num_bytes: 3241090573 num_examples: 9690914 download_size: 1693120344 dataset_size: 3241090573 - config_name: ru-zh features: - name: translation dtype: translation: languages: - ru - zh splits: - name: train num_bytes: 4233867889 num_examples: 9557007 download_size: 1984600328 dataset_size: 4233867889 configs: - config_name: ar-de data_files: - split: train path: ar-de/train-* - config_name: ar-en data_files: - split: train path: ar-en/train-* - config_name: ar-es data_files: - split: train path: ar-es/train-* - config_name: ar-fr data_files: - split: train path: ar-fr/train-* - config_name: ar-ru data_files: - split: train path: ar-ru/train-* - config_name: ar-zh data_files: - split: train path: ar-zh/train-* - config_name: de-en data_files: - split: train path: de-en/train-* - config_name: de-es data_files: - split: train path: de-es/train-* - config_name: de-fr data_files: - split: train path: de-fr/train-* - config_name: de-ru data_files: - split: train path: de-ru/train-* - config_name: de-zh data_files: - split: train path: de-zh/train-* - config_name: en-es data_files: - split: train path: en-es/train-* - config_name: en-fr data_files: - split: train path: en-fr/train-* - config_name: en-ru data_files: - split: train path: en-ru/train-* - config_name: en-zh data_files: - split: train path: en-zh/train-* - config_name: es-fr data_files: - split: train path: es-fr/train-* - config_name: es-ru data_files: - split: train path: es-ru/train-* - config_name: es-zh data_files: - split: train path: es-zh/train-* - config_name: fr-ru data_files: - split: train path: fr-ru/train-* - config_name: fr-zh data_files: - split: train path: fr-zh/train-* - config_name: ru-zh data_files: - split: train path: ru-zh/train-* --- # Dataset Card for OPUS MultiUN ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** https://opus.nlpl.eu/MultiUN/corpus/version/MultiUN - **Repository:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **Paper:** https://aclanthology.org/L10-1473/ - **Leaderboard:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **Point of Contact:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Dataset Summary The MultiUN parallel corpus is extracted from the United Nations Website , and then cleaned and converted to XML at Language Technology Lab in DFKI GmbH (LT-DFKI), Germany. The documents were published by UN from 2000 to 2009. This is a collection of translated documents from the United Nations originally compiled by Andreas Eisele and Yu Chen (see http://www.euromatrixplus.net/multi-un/). This corpus is available in all 6 official languages of the UN consisting of around 300 million words per language ### Supported Tasks and Leaderboards The underlying task is machine translation. ### Languages Parallel texts are present in all six official languages: Arabic (`ar`), Chinese (`zh`), English (`en`), French (`fr`), Russian (`ru`) and Spanish (`es`), with a small part of the documents available also in German (`de`). ## Dataset Structure ### Data Instances ``` { "translation": { "ar": "قرار اتخذته الجمعية العامة", "de": "Resolution der Generalversammlung" } } ``` ### Data Fields - `translation` (`dict`): Parallel sentences for the pair of languages. ### Data Splits The dataset contains a single "train" split for each language pair. ## Dataset Creation ### Curation Rationale [More Information Needed] ### Source Data #### Initial Data Collection and Normalization Original MultiUN source data: http://www.euromatrixplus.net/multi-unp #### Who are the source language producers? [More Information Needed] ### Annotations #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information [More Information Needed] ### Citation Information If you use this corpus in your work, please cite the paper: ``` @inproceedings{eisele-chen-2010-multiun, title = "{M}ulti{UN}: A Multilingual Corpus from United Nation Documents", author = "Eisele, Andreas and Chen, Yu", booktitle = "Proceedings of the Seventh International Conference on Language Resources and Evaluation ({LREC}'10)", month = may, year = "2010", address = "Valletta, Malta", publisher = "European Language Resources Association (ELRA)", url = "http://www.lrec-conf.org/proceedings/lrec2010/pdf/686_Paper.pdf", abstract = "This paper describes the acquisition, preparation and properties of a corpus extracted from the official documents of the United Nations (UN). This corpus is available in all 6 official languages of the UN, consisting of around 300 million words per language. We describe the methods we used for crawling, document formatting, and sentence alignment. This corpus also includes a common test set for machine translation. We present the results of a French-Chinese machine translation experiment performed on this corpus.", } ``` If you use any part of the corpus (hosted in OPUS) in your own work, please cite the following article: ``` @inproceedings{tiedemann-2012-parallel, title = "Parallel Data, Tools and Interfaces in {OPUS}", author = {Tiedemann, J{\"o}rg}, editor = "Calzolari, Nicoletta and Choukri, Khalid and Declerck, Thierry and Do{\u{g}}an, Mehmet U{\u{g}}ur and Maegaard, Bente and Mariani, Joseph and Moreno, Asuncion and Odijk, Jan and Piperidis, Stelios", booktitle = "Proceedings of the Eighth International Conference on Language Resources and Evaluation ({LREC}'12)", month = may, year = "2012", address = "Istanbul, Turkey", publisher = "European Language Resources Association (ELRA)", url = "http://www.lrec-conf.org/proceedings/lrec2012/pdf/463_Paper.pdf", pages = "2214--2218", abstract = "This paper presents the current status of OPUS, a growing language resource of parallel corpora and related tools. The focus in OPUS is to provide freely available data sets in various formats together with basic annotation to be useful for applications in computational linguistics, translation studies and cross-linguistic corpus studies. In this paper, we report about new data sets and their features, additional annotation tools and models provided from the website and essential interfaces and on-line services included in the project.", } ``` ### Contributions Thanks to [@patil-suraj](https://github.com/patil-suraj) for adding this dataset.
提供机构:
Jorathan
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作