opus_wikipedia
收藏魔搭社区2025-12-05 更新2025-08-23 收录
下载链接:
https://modelscope.cn/datasets/Helsinki-NLP/opus_wikipedia
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for OpusWikipedia
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:** http://opus.nlpl.eu/Wikipedia.php
- **Repository:** None
- **Paper:** http://www.lrec-conf.org/proceedings/lrec2012/pdf/463_Paper.pdf
- **Leaderboard:** [More Information Needed]
- **Point of Contact:** [More Information Needed]
### Dataset Summary
This is a corpus of parallel sentences extracted from Wikipedia by Krzysztof Wołk and Krzysztof Marasek.
Tha dataset contains 20 languages and 36 bitexts.
To load a language pair which isn't part of the config, all you need to do is specify the language code as pairs,
e.g.
```python
dataset = load_dataset("opus_wikipedia", lang1="it", lang2="pl")
```
You can find the valid pairs in Homepage section of Dataset Description: http://opus.nlpl.eu/Wikipedia.php
### Supported Tasks and Leaderboards
[More Information Needed]
### Languages
The languages in the dataset are:
- ar
- bg
- cs
- de
- el
- en
- es
- fa
- fr
- he
- hu
- it
- nl
- pl
- pt
- ro
- ru
- sl
- tr
- vi
## Dataset Structure
### Data Instances
```
{
'id': '0',
'translation': {
"ar": "* Encyclopaedia of Mathematics online encyclopaedia from Springer, Graduate-level reference work with over 8,000 entries, illuminating nearly 50,000 notions in mathematics.",
"en": "*Encyclopaedia of Mathematics online encyclopaedia from Springer, Graduate-level reference work with over 8,000 entries, illuminating nearly 50,000 notions in mathematics."
}
}
```
### Data Fields
- `id` (`str`): Unique identifier of the parallel sentence for the pair of languages.
- `translation` (`dict`): Parallel sentences for the pair of languages.
### Data Splits
The dataset contains a single `train` split.
## Dataset Creation
### Curation Rationale
[More Information Needed]
### Source Data
[More Information Needed]
#### Initial Data Collection and Normalization
[More Information Needed]
#### Who are the source language producers?
[More Information Needed]
### Annotations
[More Information Needed]
#### Annotation process
[More Information Needed]
#### Who are the annotators?
[More Information Needed]
### Personal and Sensitive Information
[More Information Needed]
## Considerations for Using the Data
### Social Impact of Dataset
[More Information Needed]
### Discussion of Biases
[More Information Needed]
### Other Known Limitations
[More Information Needed]
## Additional Information
### Dataset Curators
[More Information Needed]
### Licensing Information
[More Information Needed]
### Citation Information
```bibtex
@article{WOLK2014126,
title = {Building Subject-aligned Comparable Corpora and Mining it for Truly Parallel Sentence Pairs},
journal = {Procedia Technology},
volume = {18},
pages = {126-132},
year = {2014},
note = {International workshop on Innovations in Information and Communication Science and Technology, IICST 2014, 3-5 September 2014, Warsaw, Poland},
issn = {2212-0173},
doi = {https://doi.org/10.1016/j.protcy.2014.11.024},
url = {https://www.sciencedirect.com/science/article/pii/S2212017314005453},
author = {Krzysztof Wołk and Krzysztof Marasek},
keywords = {Comparable corpora, machine translation, NLP},
}
```
```bibtex
@InProceedings{TIEDEMANN12.463,
author = {J{\"o}rg Tiedemann},
title = {Parallel Data, Tools and Interfaces in OPUS},
booktitle = {Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12)},
year = {2012},
month = {may},
date = {23-25},
address = {Istanbul, Turkey},
editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Mehmet Ugur Dogan and Bente Maegaard and Joseph Mariani and Jan Odijk and Stelios Piperidis},
publisher = {European Language Resources Association (ELRA)},
isbn = {978-2-9517408-7-7},
language = {english}
}
```
### Contributions
Thanks to [@rkc007](https://github.com/rkc007) for adding this dataset.
# OpusWikipedia 数据集卡片
## 目录
- [数据集描述](#dataset-description)
- [数据集摘要](#dataset-summary)
- [支持任务与排行榜](#supported-tasks-and-leaderboards)
- [语言覆盖](#languages)
- [数据集结构](#dataset-structure)
- [数据实例](#data-instances)
- [数据字段](#data-fields)
- [数据划分](#data-splits)
- [数据集构建](#dataset-creation)
- [遴选依据](#curation-rationale)
- [源数据](#source-data)
- [标注信息](#annotations)
- [个人与敏感信息](#personal-and-sensitive-information)
- [数据集使用注意事项](#considerations-for-using-the-data)
- [数据集的社会影响](#social-impact-of-dataset)
- [偏差讨论](#discussion-of-biases)
- [其他已知局限性](#other-known-limitations)
- [附加信息](#additional-information)
- [数据集策展人](#dataset-curators)
- [许可信息](#licensing-information)
- [引用信息](#citation-information)
- [贡献致谢](#contributions)
## 数据集描述
- **主页**:http://opus.nlpl.eu/Wikipedia.php
- **代码仓库**:无
- **相关论文**:http://www.lrec-conf.org/proceedings/lrec2012/pdf/463_Paper.pdf
- **排行榜**:[需补充更多信息]
- **联系人**:[需补充更多信息]
### 数据集摘要
本数据集为由Krzysztof Wołk与Krzysztof Marasek从维基百科中提取的平行句对语料库(parallel sentence corpus)。
该数据集涵盖20种语言,包含36组双语语料对(bitext)。
若需加载配置中未内置的语言对,仅需以语言代码对的形式指定目标语言即可,示例如下:
python
dataset = load_dataset("opus_wikipedia", lang1="it", lang2="pl")
可在数据集描述的主页链接http://opus.nlpl.eu/Wikipedia.php中查询合法语言对列表。
### 支持任务与排行榜
[需补充更多信息]
### 语言覆盖
本数据集涵盖的语言如下:
- 阿拉伯语(ar)
- 保加利亚语(bg)
- 捷克语(cs)
- 德语(de)
- 希腊语(el)
- 英语(en)
- 西班牙语(es)
- 波斯语(fa)
- 法语(fr)
- 希伯来语(he)
- 匈牙利语(hu)
- 意大利语(it)
- 荷兰语(nl)
- 波兰语(pl)
- 葡萄牙语(pt)
- 罗马尼亚语(ro)
- 俄语(ru)
- 斯洛文尼亚语(sl)
- 土耳其语(tr)
- 越南语(vi)
## 数据集结构
### 数据实例
{
'id': '0',
'translation': {
"ar": "* Encyclopaedia of Mathematics online encyclopaedia from Springer, Graduate-level reference work with over 8,000 entries, illuminating nearly 50,000 notions in mathematics.",
"en": "*Encyclopaedia of Mathematics online encyclopaedia from Springer, Graduate-level reference work with over 8,000 entries, illuminating nearly 50,000 notions in mathematics."
}
}
### 数据字段
- `id` (`str`):该双语语言对的平行句对唯一标识符。
- `translation` (`dict`):对应双语语言对的平行句对字典。
### 数据划分
本数据集仅包含单一`train`(训练)子集。
## 数据集构建
### 遴选依据
[需补充更多信息]
### 源数据
[需补充更多信息]
#### 初始数据收集与标准化
[需补充更多信息]
#### 源语言文本创作者为何人?
[需补充更多信息]
### 标注信息
[需补充更多信息]
#### 标注流程
[需补充更多信息]
#### 标注者为何人?
[需补充更多信息]
### 个人与敏感信息
[需补充更多信息]
## 数据集使用注意事项
### 数据集的社会影响
[需补充更多信息]
### 偏差讨论
[需补充更多信息]
### 其他已知局限性
[需补充更多信息]
## 附加信息
### 数据集策展人
[需补充更多信息]
### 许可信息
[需补充更多信息]
### 引用信息
bibtex
@article{WOLK2014126,
title = {Building Subject-aligned Comparable Corpora and Mining it for Truly Parallel Sentence Pairs},
journal = {Procedia Technology},
volume = {18},
pages = {126-132},
year = {2014},
note = {International workshop on Innovations in Information and Communication Science and Technology, IICST 2014, 3-5 September 2014, Warsaw, Poland},
issn = {2212-0173},
doi = {https://doi.org/10.1016/j.protcy.2014.11.024},
url = {https://www.sciencedirect.com/science/article/pii/S2212017314005453},
author = {Krzysztof Wołk and Krzysztof Marasek},
keywords = {Comparable corpora, machine translation, NLP},
}
bibtex
@InProceedings{TIEDEMANN12.463,
author = {J{"o}rg Tiedemann},
title = {Parallel Data, Tools and Interfaces in OPUS},
booktitle = {Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12)},
year = {2012},
month = {may},
date = {23-25},
address = {Istanbul, Turkey},
editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Mehmet Ugur Dogan and Bente Maegaard and Joseph Mariani and Jan Odijk and Stelios Piperidis},
publisher = {European Language Resources Association (ELRA)},
isbn = {978-2-9517408-7-7},
language = {english}
}
### 贡献致谢
感谢[@rkc007](https://github.com/rkc007)贡献本数据集。
提供机构:
maas
创建时间:
2025-08-16



