five

NeuML/wikipedia

收藏
Hugging Face2024-01-11 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/NeuML/wikipedia
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - no-annotation language_creators: - crowdsourced pretty_name: Wikipedia paperswithcode_id: null license: - cc-by-sa-3.0 - gfdl task_categories: - text-generation - fill-mask task_ids: - language-modeling - masked-language-modeling source_datasets: - original multilinguality: - multilingual size_categories: - n<1K - 1K<n<10K - 10K<n<100K - 100K<n<1M - 1M<n<10M language: - aa - ab - ace - af - ak - als - am - an - ang - ar - arc - arz - as - ast - atj - av - ay - az - azb - ba - bar - bcl - be - bg - bh - bi - bjn - bm - bn - bo - bpy - br - bs - bug - bxr - ca - cbk - cdo - ce - ceb - ch - cho - chr - chy - ckb - co - cr - crh - cs - csb - cu - cv - cy - da - de - din - diq - dsb - dty - dv - dz - ee - el - eml - en - eo - es - et - eu - ext - fa - ff - fi - fj - fo - fr - frp - frr - fur - fy - ga - gag - gan - gd - gl - glk - gn - gom - gor - got - gu - gv - ha - hak - haw - he - hi - hif - ho - hr - hsb - ht - hu - hy - ia - id - ie - ig - ii - ik - ilo - inh - io - is - it - iu - ja - jam - jbo - jv - ka - kaa - kab - kbd - kbp - kg - ki - kj - kk - kl - km - kn - ko - koi - krc - ks - ksh - ku - kv - kw - ky - la - lad - lb - lbe - lez - lfn - lg - li - lij - lmo - ln - lo - lrc - lt - ltg - lv - lzh - mai - mdf - mg - mh - mhr - mi - min - mk - ml - mn - mr - mrj - ms - mt - mus - mwl - my - myv - mzn - na - nah - nan - nap - nds - ne - new - ng - nl - nn - 'no' - nov - nrf - nso - nv - ny - oc - olo - om - or - os - pa - pag - pam - pap - pcd - pdc - pfl - pi - pih - pl - pms - pnb - pnt - ps - pt - qu - rm - rmy - rn - ro - ru - rue - rup - rw - sa - sah - sat - sc - scn - sco - sd - se - sg - sgs - sh - si - sk - sl - sm - sn - so - sq - sr - srn - ss - st - stq - su - sv - sw - szl - ta - tcy - tdt - te - tg - th - ti - tk - tl - tn - to - tpi - tr - ts - tt - tum - tw - ty - tyv - udm - ug - uk - ur - uz - ve - vec - vep - vi - vls - vo - vro - wa - war - wo - wuu - xal - xh - xmf - yi - yo - yue - za - zea - zh - zu language_bcp47: - nds-nl config_names: - 20240101.aa - 20220101.ab - 20240101.ace - 20240101.ady - 20240101.af - 20240101.ak - 20240101.als - 20240101.am - 20240101.an - 20240101.ang - 20240101.ar - 20240101.arc - 20240101.arz - 20240101.as - 20240101.ast - 20240101.atj - 20240101.av - 20240101.ay - 20240101.az - 20240101.azb - 20240101.ba - 20240101.bar - 20240101.bat-smg - 20240101.bcl - 20240101.be - 20240101.be-x-old - 20240101.bg - 20240101.bh - 20240101.bi - 20240101.bjn - 20240101.bm - 20240101.bn - 20240101.bo - 20240101.bpy - 20240101.br - 20240101.bs - 20240101.bug - 20240101.bxr - 20240101.ca - 20240101.cbk-zam - 20240101.cdo - 20240101.ce - 20240101.ceb - 20240101.ch - 20240101.cho - 20240101.chr - 20240101.chy - 20240101.ckb - 20240101.co - 20240101.cr - 20240101.crh - 20240101.cs - 20240101.csb - 20240101.cu - 20240101.cv - 20240101.cy - 20240101.da - 20240101.de - 20240101.din - 20240101.diq - 20240101.dsb - 20240101.dty - 20240101.dv - 20240101.dz - 20240101.ee - 20240101.el - 20240101.eml - 20240101.en - 20240101.eo - 20240101.es - 20240101.et - 20240101.eu - 20240101.ext - 20240101.fa - 20240101.ff - 20240101.fi - 20240101.fiu-vro - 20240101.fj - 20240101.fo - 20240101.fr - 20240101.frp - 20240101.frr - 20240101.fur - 20240101.fy - 20240101.ga - 20240101.gag - 20240101.gan - 20240101.gd - 20240101.gl - 20240101.glk - 20240101.gn - 20240101.gom - 20240101.gor - 20240101.got - 20240101.gu - 20240101.gv - 20240101.ha - 20240101.hak - 20240101.haw - 20240101.he - 20240101.hi - 20240101.hif - 20240101.ho - 20240101.hr - 20240101.hsb - 20240101.ht - 20240101.hu - 20240101.hy - 20240101.ia - 20240101.id - 20240101.ie - 20240101.ig - 20240101.ii - 20240101.ik - 20240101.ilo - 20240101.inh - 20240101.io - 20240101.is - 20240101.it - 20240101.iu - 20240101.ja - 20240101.jam - 20240101.jbo - 20240101.jv - 20240101.ka - 20240101.kaa - 20240101.kab - 20240101.kbd - 20240101.kbp - 20240101.kg - 20240101.ki - 20240101.kj - 20240101.kk - 20240101.kl - 20240101.km - 20240101.kn - 20240101.ko - 20240101.koi - 20240101.krc - 20240101.ks - 20240101.ksh - 20240101.ku - 20240101.kv - 20240101.kw - 20240101.ky - 20240101.la - 20240101.lad - 20240101.lb - 20240101.lbe - 20240101.lez - 20240101.lfn - 20240101.lg - 20240101.li - 20240101.lij - 20240101.lmo - 20240101.ln - 20240101.lo - 20240101.lrc - 20240101.lt - 20240101.ltg - 20240101.lv - 20240101.mai - 20240101.map-bms - 20240101.mdf - 20240101.mg - 20240101.mh - 20240101.mhr - 20240101.mi - 20240101.min - 20240101.mk - 20240101.ml - 20240101.mn - 20240101.mr - 20240101.mrj - 20240101.ms - 20240101.mt - 20240101.mus - 20240101.mwl - 20240101.my - 20240101.myv - 20240101.mzn - 20240101.na - 20240101.nah - 20240101.nap - 20240101.nds - 20240101.nds-nl - 20240101.ne - 20240101.new - 20240101.ng - 20240101.nl - 20240101.nn - 20240101.no - 20240101.nov - 20240101.nrm - 20240101.nso - 20240101.nv - 20240101.ny - 20240101.oc - 20240101.olo - 20240101.om - 20240101.or - 20240101.os - 20240101.pa - 20240101.pag - 20240101.pam - 20240101.pap - 20240101.pcd - 20240101.pdc - 20240101.pfl - 20240101.pi - 20240101.pih - 20240101.pl - 20240101.pms - 20240101.pnb - 20240101.pnt - 20240101.ps - 20240101.pt - 20240101.qu - 20240101.rm - 20240101.rmy - 20240101.rn - 20240101.ro - 20240101.roa-rup - 20240101.roa-tara - 20240101.ru - 20240101.rue - 20240101.rw - 20240101.sa - 20240101.sah - 20240101.sat - 20240101.sc - 20240101.scn - 20240101.sco - 20240101.sd - 20240101.se - 20240101.sg - 20240101.sh - 20240101.si - 20240101.simple - 20240101.sk - 20240101.sl - 20240101.sm - 20240101.sn - 20240101.so - 20240101.sq - 20240101.sr - 20240101.srn - 20240101.ss - 20240101.st - 20240101.stq - 20240101.su - 20240101.sv - 20240101.sw - 20240101.szl - 20240101.ta - 20240101.tcy - 20240101.te - 20240101.tet - 20240101.tg - 20240101.th - 20240101.ti - 20240101.tk - 20240101.tl - 20240101.tn - 20240101.to - 20240101.tpi - 20240101.tr - 20240101.ts - 20240101.tt - 20240101.tum - 20240101.tw - 20240101.ty - 20240101.tyv - 20240101.udm - 20240101.ug - 20240101.uk - 20240101.ur - 20240101.uz - 20240101.ve - 20240101.vec - 20240101.vep - 20240101.vi - 20240101.vls - 20240101.vo - 20240101.wa - 20240101.war - 20240101.wo - 20240101.wuu - 20240101.xal - 20240101.xh - 20240101.xmf - 20240101.yi - 20240101.yo - 20240101.za - 20240101.zea - 20240101.zh - 20240101.zh-classical - 20240101.zh-min-nan - 20240101.zh-yue - 20240101.zu --- # Dataset Card for Wikipedia This repo is a fork of the [olm/wikipedia](https://huggingface.co/datasets/olm/wikipedia) repo which itself is a fork of the original Hugging Face Wikipedia repo [here](https://huggingface.co/datasets/wikipedia). This fork modifies `olm/wikipedia` to enable running on lower resourced machines. These changes have been proposed as a [PR with the olm/wikipedia project](https://huggingface.co/datasets/olm/wikipedia/discussions/6). ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [https://dumps.wikimedia.org](https://dumps.wikimedia.org) - **Repository:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **Paper:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **Point of Contact:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Dataset Summary Wikipedia dataset containing cleaned articles of all languages. The datasets are built from the Wikipedia dump (https://dumps.wikimedia.org/) with one split per language. Each example contains the content of one full Wikipedia article with cleaning to strip markdown and unwanted sections (references, etc.). The articles are parsed using the ``mwparserfromhell`` tool. To load this dataset you need to install the following dependencies: ``` pip install mwparserfromhell datasets ``` Then, you can load any subset of Wikipedia per language and per date this way: ```python from datasets import load_dataset load_dataset("neuml/wikipedia", language="en", date="20240101") ``` You can find the full list of languages and dates [here](https://dumps.wikimedia.org/backup-index.html). ### Supported Tasks and Leaderboards The dataset is generally used for Language Modeling. ### Languages You can find the list of languages [here](https://meta.wikimedia.org/wiki/List_of_Wikipedias). ## Dataset Structure ### Data Instances An example looks as follows: ``` {'id': '1', 'url': 'https://simple.wikipedia.org/wiki/April', 'title': 'April', 'text': 'April is the fourth month...' } ``` ### Data Fields The data fields are the same among all configurations: - `id` (`str`): ID of the article. - `url` (`str`): URL of the article. - `title` (`str`): Title of the article. - `text` (`str`): Text content of the article. ### Curation Rationale [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Source Data #### Initial Data Collection and Normalization [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### Who are the source language producers? [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Annotations #### Annotation process [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### Who are the annotators? [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Personal and Sensitive Information [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Discussion of Biases [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Other Known Limitations [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Additional Information ### Dataset Curators [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Licensing Information Most of Wikipedia's text and many of its images are co-licensed under the [Creative Commons Attribution-ShareAlike 3.0 Unported License](https://en.wikipedia.org/wiki/Wikipedia:Text_of_Creative_Commons_Attribution-ShareAlike_3.0_Unported_License)(CC BY-SA) and the [GNU Free Documentation License](https://en.wikipedia.org/wiki/Wikipedia:Text_of_the_GNU_Free_Documentation_License)(GFDL) (unversioned, with no invariant sections, front-cover texts, or back-cover texts). Some text has been imported only under CC BY-SA and CC BY-SA-compatible license and cannot be reused under GFDL; such text will be identified on the page footer, in the page history, or on the discussion page of the article that utilizes the text. ### Citation Information ``` @ONLINE{wikidump, author = "Wikimedia Foundation", title = "Wikimedia Downloads", url = "https://dumps.wikimedia.org" } ```
提供机构:
NeuML
原始信息汇总

数据集卡片 - Wikipedia

数据集描述

数据集摘要

Wikipedia数据集包含所有语言的已清理文章。这些数据集是从Wikipedia dump(https://dumps.wikimedia.org/)构建的,每种语言有一个拆分。每个示例包含一篇完整的Wikipedia文章的内容,并进行了清理以去除标记和不需要的部分(参考文献等)。

文章使用mwparserfromhell工具进行解析。

要加载此数据集,您需要安装以下依赖项:

bash pip install mwparserfromhell datasets

然后,您可以按语言和日期加载Wikipedia的任何子集:

python from datasets import load_dataset

load_dataset("neuml/wikipedia", language="en", date="20240101")

您可以在这里找到完整的语言和日期列表。

支持的任务和排行榜

该数据集通常用于语言建模。

语言

您可以在这里找到语言列表。

数据集结构

数据实例

一个示例如下:

json { "id": "1", "url": "https://simple.wikipedia.org/wiki/April", "title": "April", "text": "April is the fourth month..." }

数据字段

所有配置的数据字段相同:

  • id (str): 文章的ID。
  • url (str): 文章的URL。
  • title (str): 文章的标题。
  • text (str): 文章的文本内容。

数据拆分

更多信息需提供

数据集创建

策划理由

更多信息需提供

源数据

初始数据收集和规范化

更多信息需提供

源语言生产者是谁?

更多信息需提供

注释

注释过程

更多信息需提供

注释者是谁?

更多信息需提供

个人和敏感信息

更多信息需提供

使用数据的注意事项

数据集的社会影响

更多信息需提供

偏见的讨论

更多信息需提供

其他已知限制

更多信息需提供

附加信息

数据集策展人

更多信息需提供

许可信息

Wikipedia的大部分文本和许多图像都是共同许可的,遵循Creative Commons Attribution-ShareAlike 3.0 Unported License(CC BY-SA)和GNU Free Documentation License(GFDL)(未版本化,无不变部分,无封面文本或封底文本)。

某些文本仅在CC BY-SA和CC BY-SA兼容许可下导入,不能在GFDL下重复使用;此类文本将在页面页脚、页面历史记录或使用该文本的文章的讨论页上进行标识。

引用信息

bibtex @ONLINE{wikidump, author = "Wikimedia Foundation", title = "Wikimedia Downloads", url = "https://dumps.wikimedia.org" }

搜集汇总
数据集介绍
main_image_url
构建方式
在自然语言处理领域,大规模文本数据集的构建对于语言模型的训练至关重要。NeuML/wikipedia数据集通过系统化流程构建,其核心数据来源于维基媒体基金会定期发布的完整文章转储。该数据集采用mwparserfromhell工具对原始维基百科标记语言进行解析,自动剥离参考文献、导航模板等非正文内容,保留经过清洗的纯文本条目。每个语言版本均以独立子集形式组织,并按照特定时间戳进行版本管理,确保数据来源的透明性与可追溯性。这种构建方式既继承了维基百科知识结构的完整性,又通过技术处理适配了机器学习任务的需求。
特点
作为多语言文本资源的典范,该数据集展现出显著的跨语言覆盖特性。其囊括了从广泛使用的英语、中文到诸多低资源语言在内的数百种语言变体,为语言多样性研究提供了珍贵素材。数据集采用标准化字段结构,每条记录均包含文章标识符、原始链接、标题与清洗后的正文内容,这种统一范式极大便利了跨语言对比分析。特别值得注意的是,该版本针对计算资源受限环境进行了优化,通过技术改进降低了数据加载与处理的硬件门槛,使得更广泛的研究群体能够利用这一知识宝库。
使用方法
在具体应用层面,研究者可通过HuggingFace数据集库的标准化接口灵活调用该资源。使用前需安装mwparserfromhell与datasets依赖包,随后通过指定语言代码与日期参数即可加载特定子集,例如加载2024年1月的英文维基百科数据。这种模块化设计支持按需获取数据,有效避免了全量下载的存储压力。数据集主要服务于语言建模与掩码语言建模任务,其清洗后的文本可直接用于模型预训练或微调,而结构化的元数据字段则为知识检索、跨语言对齐等衍生研究提供了基础支撑。
背景与挑战
背景概述
维基百科数据集作为自然语言处理领域的基石性资源,其构建源于对大规模、多语言文本语料库的迫切需求。该数据集由维基媒体基金会通过社区协作方式持续创建与维护,自2001年项目启动以来,已成为全球最大的开放式知识库。其核心研究问题在于为语言模型预训练、跨语言理解及知识密集型任务提供高质量、结构化的文本数据。该数据集通过整合数百种语言的条目,极大地推动了多模态学习、机器翻译及语义分析等前沿方向的发展,为人工智能的普惠化与全球化奠定了数据基础。
当前挑战
维基百科数据集在应用层面面临多重挑战:其一,数据质量与一致性问题,由于众包编辑模式,条目间存在信息冗余、表述差异乃至事实性错误,影响模型训练的可靠性;其二,语言资源分布不均,主流语言如英语、中文数据丰富,而许多低资源语言条目稀缺,制约了多语言模型的均衡发展;其三,构建过程中的技术挑战,包括对复杂维基标记语言的解析、非结构化文本的清洗以及跨版本数据的同步与去重,这些均对数据处理流程的鲁棒性提出较高要求。
常用场景
经典使用场景
在自然语言处理领域,维基百科数据集作为大规模、多语言的文本资源,其经典使用场景在于语言模型的预训练。该数据集覆盖数百种语言,提供了结构化的百科条目文本,经过清洗后去除了标记和无关部分,为模型学习人类语言的语法、语义和知识提供了丰富的语料基础。研究者常利用其构建跨语言的表示模型,或针对特定语言进行深入的语义分析,从而推动机器对自然语言的理解与生成能力。
实际应用
在实际应用中,维基百科数据集被广泛集成于搜索引擎、智能助手和内容推荐系统中,以增强其知识检索与内容生成能力。企业利用该数据集训练模型,提升自动摘要、机器翻译等服务的准确性与流畅度。教育科技领域则借助其构建智能辅导系统,为学生提供准确的知识解释。此外,在数字人文研究中,该数据集支持大规模文本分析,帮助学者探索语言演变与文化传播模式。
衍生相关工作
基于维基百科数据集,学术界衍生了一系列经典工作。例如,BERT、GPT等预训练语言模型均使用其作为核心训练数据,奠定了现代自然语言处理的基石。跨语言模型如XLM-R利用该数据集的多语言文本,实现了强大的零样本迁移能力。知识图谱构建项目如DBpedia也从维基百科中提取结构化信息,推动了语义网的发展。这些工作共同深化了语言表示与知识融合的研究前沿。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作