opus_wikipedia

Name: opus_wikipedia
Creator: maas
Published: 2025-12-05 16:46:40
License: 暂无描述

魔搭社区2025-12-05 更新2025-08-23 收录

下载链接：

https://modelscope.cn/datasets/Helsinki-NLP/opus_wikipedia

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Card for OpusWikipedia ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** http://opus.nlpl.eu/Wikipedia.php - **Repository:** None - **Paper:** http://www.lrec-conf.org/proceedings/lrec2012/pdf/463_Paper.pdf - **Leaderboard:** [More Information Needed] - **Point of Contact:** [More Information Needed] ### Dataset Summary This is a corpus of parallel sentences extracted from Wikipedia by Krzysztof Wołk and Krzysztof Marasek. Tha dataset contains 20 languages and 36 bitexts. To load a language pair which isn't part of the config, all you need to do is specify the language code as pairs, e.g. ```python dataset = load_dataset("opus_wikipedia", lang1="it", lang2="pl") ``` You can find the valid pairs in Homepage section of Dataset Description: http://opus.nlpl.eu/Wikipedia.php ### Supported Tasks and Leaderboards [More Information Needed] ### Languages The languages in the dataset are: - ar - bg - cs - de - el - en - es - fa - fr - he - hu - it - nl - pl - pt - ro - ru - sl - tr - vi ## Dataset Structure ### Data Instances ``` { 'id': '0', 'translation': { "ar": "* Encyclopaedia of Mathematics online encyclopaedia from Springer, Graduate-level reference work with over 8,000 entries, illuminating nearly 50,000 notions in mathematics.", "en": "*Encyclopaedia of Mathematics online encyclopaedia from Springer, Graduate-level reference work with over 8,000 entries, illuminating nearly 50,000 notions in mathematics." } } ``` ### Data Fields - `id` (`str`): Unique identifier of the parallel sentence for the pair of languages. - `translation` (`dict`): Parallel sentences for the pair of languages. ### Data Splits The dataset contains a single `train` split. ## Dataset Creation ### Curation Rationale [More Information Needed] ### Source Data [More Information Needed] #### Initial Data Collection and Normalization [More Information Needed] #### Who are the source language producers? [More Information Needed] ### Annotations [More Information Needed] #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information [More Information Needed] ### Citation Information ```bibtex @article{WOLK2014126, title = {Building Subject-aligned Comparable Corpora and Mining it for Truly Parallel Sentence Pairs}, journal = {Procedia Technology}, volume = {18}, pages = {126-132}, year = {2014}, note = {International workshop on Innovations in Information and Communication Science and Technology, IICST 2014, 3-5 September 2014, Warsaw, Poland}, issn = {2212-0173}, doi = {https://doi.org/10.1016/j.protcy.2014.11.024}, url = {https://www.sciencedirect.com/science/article/pii/S2212017314005453}, author = {Krzysztof Wołk and Krzysztof Marasek}, keywords = {Comparable corpora, machine translation, NLP}, } ``` ```bibtex @InProceedings{TIEDEMANN12.463, author = {J{\"o}rg Tiedemann}, title = {Parallel Data, Tools and Interfaces in OPUS}, booktitle = {Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12)}, year = {2012}, month = {may}, date = {23-25}, address = {Istanbul, Turkey}, editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Mehmet Ugur Dogan and Bente Maegaard and Joseph Mariani and Jan Odijk and Stelios Piperidis}, publisher = {European Language Resources Association (ELRA)}, isbn = {978-2-9517408-7-7}, language = {english} } ``` ### Contributions Thanks to [@rkc007](https://github.com/rkc007) for adding this dataset.

# OpusWikipedia 数据集卡片 ## 目录 - [数据集描述](#dataset-description) - [数据集摘要](#dataset-summary) - [支持任务与排行榜](#supported-tasks-and-leaderboards) - [语言覆盖](#languages) - [数据集结构](#dataset-structure) - [数据实例](#data-instances) - [数据字段](#data-fields) - [数据划分](#data-splits) - [数据集构建](#dataset-creation) - [遴选依据](#curation-rationale) - [源数据](#source-data) - [标注信息](#annotations) - [个人与敏感信息](#personal-and-sensitive-information) - [数据集使用注意事项](#considerations-for-using-the-data) - [数据集的社会影响](#social-impact-of-dataset) - [偏差讨论](#discussion-of-biases) - [其他已知局限性](#other-known-limitations) - [附加信息](#additional-information) - [数据集策展人](#dataset-curators) - [许可信息](#licensing-information) - [引用信息](#citation-information) - [贡献致谢](#contributions) ## 数据集描述 - **主页**：http://opus.nlpl.eu/Wikipedia.php - **代码仓库**：无 - **相关论文**：http://www.lrec-conf.org/proceedings/lrec2012/pdf/463_Paper.pdf - **排行榜**：[需补充更多信息] - **联系人**：[需补充更多信息] ### 数据集摘要本数据集为由Krzysztof Wołk与Krzysztof Marasek从维基百科中提取的平行句对语料库（parallel sentence corpus）。该数据集涵盖20种语言，包含36组双语语料对（bitext）。若需加载配置中未内置的语言对，仅需以语言代码对的形式指定目标语言即可，示例如下： python dataset = load_dataset("opus_wikipedia", lang1="it", lang2="pl") 可在数据集描述的主页链接http://opus.nlpl.eu/Wikipedia.php中查询合法语言对列表。 ### 支持任务与排行榜 [需补充更多信息] ### 语言覆盖本数据集涵盖的语言如下： - 阿拉伯语(ar) - 保加利亚语(bg) - 捷克语(cs) - 德语(de) - 希腊语(el) - 英语(en) - 西班牙语(es) - 波斯语(fa) - 法语(fr) - 希伯来语(he) - 匈牙利语(hu) - 意大利语(it) - 荷兰语(nl) - 波兰语(pl) - 葡萄牙语(pt) - 罗马尼亚语(ro) - 俄语(ru) - 斯洛文尼亚语(sl) - 土耳其语(tr) - 越南语(vi) ## 数据集结构 ### 数据实例 { 'id': '0', 'translation': { "ar": "* Encyclopaedia of Mathematics online encyclopaedia from Springer, Graduate-level reference work with over 8,000 entries, illuminating nearly 50,000 notions in mathematics.", "en": "*Encyclopaedia of Mathematics online encyclopaedia from Springer, Graduate-level reference work with over 8,000 entries, illuminating nearly 50,000 notions in mathematics." } } ### 数据字段 - `id` (`str`)：该双语语言对的平行句对唯一标识符。 - `translation` (`dict`)：对应双语语言对的平行句对字典。 ### 数据划分本数据集仅包含单一`train`（训练）子集。 ## 数据集构建 ### 遴选依据 [需补充更多信息] ### 源数据 [需补充更多信息] #### 初始数据收集与标准化 [需补充更多信息] #### 源语言文本创作者为何人？ [需补充更多信息] ### 标注信息 [需补充更多信息] #### 标注流程 [需补充更多信息] #### 标注者为何人？ [需补充更多信息] ### 个人与敏感信息 [需补充更多信息] ## 数据集使用注意事项 ### 数据集的社会影响 [需补充更多信息] ### 偏差讨论 [需补充更多信息] ### 其他已知局限性 [需补充更多信息] ## 附加信息 ### 数据集策展人 [需补充更多信息] ### 许可信息 [需补充更多信息] ### 引用信息 bibtex @article{WOLK2014126, title = {Building Subject-aligned Comparable Corpora and Mining it for Truly Parallel Sentence Pairs}, journal = {Procedia Technology}, volume = {18}, pages = {126-132}, year = {2014}, note = {International workshop on Innovations in Information and Communication Science and Technology, IICST 2014, 3-5 September 2014, Warsaw, Poland}, issn = {2212-0173}, doi = {https://doi.org/10.1016/j.protcy.2014.11.024}, url = {https://www.sciencedirect.com/science/article/pii/S2212017314005453}, author = {Krzysztof Wołk and Krzysztof Marasek}, keywords = {Comparable corpora, machine translation, NLP}, } bibtex @InProceedings{TIEDEMANN12.463, author = {J{"o}rg Tiedemann}, title = {Parallel Data, Tools and Interfaces in OPUS}, booktitle = {Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12)}, year = {2012}, month = {may}, date = {23-25}, address = {Istanbul, Turkey}, editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Mehmet Ugur Dogan and Bente Maegaard and Joseph Mariani and Jan Odijk and Stelios Piperidis}, publisher = {European Language Resources Association (ELRA)}, isbn = {978-2-9517408-7-7}, language = {english} } ### 贡献致谢感谢[@rkc007](https://github.com/rkc007)贡献本数据集。

提供机构：

maas

创建时间：

2025-08-16

5,000+

优质数据集

54 个

任务类型

进入经典数据集