opus-100
收藏魔搭社区2025-12-05 更新2025-08-23 收录
下载链接:
https://modelscope.cn/datasets/Helsinki-NLP/opus-100
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for OPUS-100
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:** https://opus.nlpl.eu/OPUS-100
- **Repository:** https://github.com/EdinburghNLP/opus-100-corpus
- **Paper:** https://arxiv.org/abs/2004.11867
- **Paper:** https://aclanthology.org/L10-1473/
- **Leaderboard:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
- **Point of Contact:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Dataset Summary
OPUS-100 is an English-centric multilingual corpus covering 100 languages.
OPUS-100 is English-centric, meaning that all training pairs include English on either the source or target side. The corpus covers 100 languages (including English).
The languages were selected based on the volume of parallel data available in OPUS.
### Supported Tasks and Leaderboards
Translation.
### Languages
OPUS-100 contains approximately 55M sentence pairs. Of the 99 language pairs, 44 have 1M sentence pairs of training data, 73 have at least 100k, and 95 have at least 10k.
## Dataset Structure
### Data Instances
```
{
"translation": {
"ca": "El departament de bombers té el seu propi equip d'investigació.",
"en": "Well, the fire department has its own investigative unit."
}
}
```
### Data Fields
- `translation` (`dict`): Parallel sentences for the pair of languages.
### Data Splits
The dataset is split into training, development, and test portions. Data was prepared by randomly sampled up to 1M sentence pairs per language pair for training and up to 2000 each for development and test. To ensure that there was no overlap (at the monolingual sentence level) between the training and development/test data, they applied a filter during sampling to exclude sentences that had already been sampled. Note that this was done cross-lingually so that, for instance, an English sentence in the Portuguese-English portion of the training data could not occur in the Hindi-English test set.
## Dataset Creation
### Curation Rationale
[More Information Needed]
### Source Data
[More Information Needed]
#### Initial Data Collection and Normalization
[More Information Needed]
#### Who are the source language producers?
[More Information Needed]
### Annotations
#### Annotation process
[More Information Needed]
#### Who are the annotators?
[More Information Needed]
### Personal and Sensitive Information
[More Information Needed]
## Considerations for Using the Data
### Social Impact of Dataset
[More Information Needed]
### Discussion of Biases
[More Information Needed]
### Other Known Limitations
[More Information Needed]
## Additional Information
### Dataset Curators
[More Information Needed]
### Licensing Information
[More Information Needed]
### Citation Information
If you use this corpus, please cite the paper:
```bibtex
@inproceedings{zhang-etal-2020-improving,
title = "Improving Massively Multilingual Neural Machine Translation and Zero-Shot Translation",
author = "Zhang, Biao and
Williams, Philip and
Titov, Ivan and
Sennrich, Rico",
editor = "Jurafsky, Dan and
Chai, Joyce and
Schluter, Natalie and
Tetreault, Joel",
booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
month = jul,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2020.acl-main.148",
doi = "10.18653/v1/2020.acl-main.148",
pages = "1628--1639",
}
```
and, please, also acknowledge OPUS:
```bibtex
@inproceedings{tiedemann-2012-parallel,
title = "Parallel Data, Tools and Interfaces in {OPUS}",
author = {Tiedemann, J{\"o}rg},
editor = "Calzolari, Nicoletta and
Choukri, Khalid and
Declerck, Thierry and
Do{\u{g}}an, Mehmet U{\u{g}}ur and
Maegaard, Bente and
Mariani, Joseph and
Moreno, Asuncion and
Odijk, Jan and
Piperidis, Stelios",
booktitle = "Proceedings of the Eighth International Conference on Language Resources and Evaluation ({LREC}'12)",
month = may,
year = "2012",
address = "Istanbul, Turkey",
publisher = "European Language Resources Association (ELRA)",
url = "http://www.lrec-conf.org/proceedings/lrec2012/pdf/463_Paper.pdf",
pages = "2214--2218",
}
```
### Contributions
Thanks to [@vasudevgupta7](https://github.com/vasudevgupta7) for adding this dataset.
# OPUS-100 数据集卡片(Dataset Card)
## 目录(Table of Contents)
- [数据集描述](#dataset-description)
- [数据集概述](#dataset-summary)
- [支持任务与基准测试榜](#supported-tasks-and-leaderboards)
- [语言](#languages)
- [数据集结构](#dataset-structure)
- [数据实例](#data-instances)
- [数据字段](#data-fields)
- [数据划分](#data-splits)
- [数据集构建](#dataset-creation)
- [数据集整理初衷](#curation-rationale)
- [源数据](#source-data)
- [标注](#annotations)
- [个人与敏感信息](#personal-and-sensitive-information)
- [数据使用注意事项](#considerations-for-using-the-data)
- [数据集的社会影响](#social-impact-of-dataset)
- [偏差讨论](#discussion-of-biases)
- [其他已知局限性](#other-known-limitations)
- [附加信息](#additional-information)
- [数据集整理者](#dataset-curators)
- [许可信息](#licensing-information)
- [引用信息](#citation-information)
- [贡献](#contributions)
## 数据集描述(Dataset Description)
- **主页:** https://opus.nlpl.eu/OPUS-100
- **代码仓库:** https://github.com/EdinburghNLP/opus-100-corpus
- **论文:** https://arxiv.org/abs/2004.11867
- **论文:** https://aclanthology.org/L10-1473/
- **基准测试榜:** [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
- **联系方式:** [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 数据集概述(Dataset Summary)
OPUS-100 是一款以英语为中心的多语言语料库,涵盖100种语言。
OPUS-100 以英语为中心,即所有训练样本对均包含英语作为源语言或目标语言。该语料库涵盖100种语言(含英语),所选语言基于 OPUS 中可用的平行数据量进行筛选。
### 支持任务与基准测试榜(Supported Tasks and Leaderboards)
机器翻译。
### 语言(Languages)
OPUS-100 包含约5500万句平行句对。在99种语言对中,44种拥有100万句训练数据,73种拥有至少10万句,95种拥有至少1万句。
## 数据集结构(Dataset Structure)
### 数据实例(Data Instances)
{
"translation": {
"ca": "El departament de bombers té el seu propi equip d'investigació.",
"en": "Well, the fire department has its own investigative unit."
}
}
### 数据字段(Data Fields)
- `translation`(`dict`):对应语言对的平行句对。
### 数据划分(Data Splits)
该数据集划分为训练集、开发集与测试集。数据准备阶段,每个语言对随机采样至多100万句作为训练数据,开发集与测试集各采样至多2000句。为确保训练集与开发/测试集之间不存在(单语句层面的)重复,采样过程中应用了过滤机制以排除已被采样过的句子。需注意,该过滤操作是跨语言进行的:例如,训练集中葡萄牙语-英语语料里的英语句子,不会出现在印地语-英语的测试集中。
## 数据集构建(Dataset Creation)
### 数据集整理初衷(Curation Rationale)
[更多信息待补充]
### 源数据(Source Data)
[更多信息待补充]
#### 初始数据收集与标准化(Initial Data Collection and Normalization)
[更多信息待补充]
#### 源语言数据的生产者是谁?(Who are the source language producers?)
[更多信息待补充]
### 标注(Annotations)
#### 标注流程(Annotation process)
[更多信息待补充]
#### 标注者是谁?(Who are the annotators?)
[更多信息待补充]
### 个人与敏感信息(Personal and Sensitive Information)
[更多信息待补充]
## 数据使用注意事项(Considerations for Using the Data)
### 数据集的社会影响(Social Impact of Dataset)
[更多信息待补充]
### 偏差讨论(Discussion of Biases)
[更多信息待补充]
### 其他已知局限性(Other Known Limitations)
[更多信息待补充]
## 附加信息(Additional Information)
### 数据集整理者(Dataset Curators)
[更多信息待补充]
### 许可信息(Licensing Information)
[更多信息待补充]
### 引用信息(Citation Information)
如果您使用该语料库,请引用以下论文:
bibtex
@inproceedings{zhang-etal-2020-improving,
title = "Improving Massively Multilingual Neural Machine Translation and Zero-Shot Translation",
author = "Zhang, Biao and
Williams, Philip and
Titov, Ivan and
Sennrich, Rico",
editor = "Jurafsky, Dan and
Chai, Joyce and
Schluter, Natalie and
Tetreault, Joel",
booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
month = jul,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2020.acl-main.148",
doi = "10.18653/v1/2020.acl-main.148",
pages = "1628--1639",
}
同时,请致谢 OPUS 语料库:
bibtex
@inproceedings{tiedemann-2012-parallel,
title = "Parallel Data, Tools and Interfaces in {OPUS}",
author = {Tiedemann, J{"o}rg},
editor = "Calzolari, Nicoletta and
Choukri, Khalid and
Declerck, Thierry and
Do{u{g}}an, Mehmet U{u{g}}ur and
Maegaard, Bente and
Mariani, Joseph and
Moreno, Asuncion and
Odijk, Jan and
Piperidis, Stelios",
booktitle = "Proceedings of the Eighth International Conference on Language Resources and Evaluation ({LREC}'12)",
month = may,
year = "2012",
address = "Istanbul, Turkey",
publisher = "European Language Resources Association (ELRA)",
url = "http://www.lrec-conf.org/proceedings/lrec2012/pdf/463_Paper.pdf",
pages = "2214--2218",
}
### 贡献(Contributions)
感谢 [@vasudevgupta7](https://github.com/vasudevgupta7) 为本数据集提供的整理工作。
提供机构:
maas
创建时间:
2025-08-16
搜集汇总
数据集介绍

背景与挑战
背景概述
OPUS-100是一个以英语为中心的多语言翻译语料库,涵盖100种语言,包含约5500万句对,主要用于机器翻译任务。数据集经过精心划分,确保训练集、开发集和测试集之间无重叠。
以上内容由遇见数据集搜集并总结生成



