livinNector/wikipedia
收藏Dataset Card for Wikipedia
Dataset Description
Dataset Summary
Wikipedia dataset containing cleaned articles of all languages. The datasets are built from the Wikipedia dump (https://dumps.wikimedia.org/) with one split per language. Each example contains the content of one full Wikipedia article with cleaning to strip markdown and unwanted sections (references, etc.).
Supported Tasks and Leaderboards
The dataset is generally used for Language Modeling.
Languages
The dataset includes articles in the following languages:
- aa, ab, ace, af, ak, als, am, an, ang, ar, arc, arz, as, ast, atj, av, ay, az, azb, ba, bar, bcl, be, bg, bh, bi, bjn, bm, bn, bo, bpy, br, bs, bug, bxr, ca, cbk, cdo, ce, ceb, ch, cho, chr, chy, ckb, co, cr, crh, cs, csb, cu, cv, cy, da, de, din, diq, dsb, dty, dv, dz, ee, el, eml, en, eo, es, et, eu, ext, fa, ff, fi, fj, fo, fr, frp, frr, fur, fy, ga, gag, gan, gd, gl, glk, gn, gom, gor, got, gu, gv, ha, hak, haw, he, hi, hif, ho, hr, hsb, ht, hu, hy, ia, id, ie, ig, ii, ik, ilo, inh, io, is, it, iu, ja, jam, jbo, jv, ka, kaa, kab, kbd, kbp, kg, ki, kj, kk, kl, km, kn, ko, koi, krc, ks, ksh, ku, kv, kw, ky, la, lad, lb, lbe, lez, lfn, lg, li, lij, lmo, ln, lo, lrc, lt, ltg, lv, lzh, mai, mdf, mg, mh, mhr, mi, min, mk, ml, mn, mr, mrj, ms, mt, mus, mwl, my, myv, mzn, na, nah, nan, nap, nds, ne, new, ng, nl, nn, no, nov, nrf, nso, nv, ny, oc, olo, om, or, os, pa, pag, pam, pap, pcd, pdc, pfl, pi, pih, pl, pms, pnb, pnt, ps, pt, qu, rm, rmy, rn, ro, ru, rue, rup, rw, sa, sah, sat, sc, scn, sco, sd, se, sg, sgs, sh, si, sk, sl, sm, sn, so, sq, sr, srn, ss, st, stq, su, sv, sw, szl, ta, tcy, tdt, te, tg, th, ti, tk, tl, tn, to, tpi, tr, ts, tt, tum, tw, ty, tyv, udm, ug, uk, ur, uz, ve, vec, vep, vi, vls, vo, vro, wa, war, wo, wuu, xal, xh, xmf, yi, yo, yue, za, zea, zh, zu
Dataset Structure
Data Instances
An example looks as follows:
json { "id": "1", "url": "https://simple.wikipedia.org/wiki/April", "title": "April", "text": "April is the fourth month..." }
Data Fields
The data fields are the same among all configurations:
id(str): ID of the article.url(str): URL of the article.title(str): Title of the article.text(str): Text content of the article.
Data Splits
Here are the number of examples for several configurations:
| name | train |
|---|---|
| 20220301.de | 2665357 |
| 20220301.en | 6458670 |
| 20220301.fr | 2402095 |
| 20220301.frr | 15199 |
| 20220301.it | 1743035 |
| 20220301.simple | 205328 |
Dataset Creation
Curation Rationale
The dataset is curated to provide a clean and structured source of text data from Wikipedia articles for various natural language processing tasks, particularly language modeling.
Source Data
Initial Data Collection and Normalization
The data is collected from the Wikipedia dump (https://dumps.wikimedia.org/) and processed to remove markdown and unwanted sections such as references.
Who are the source language producers?
The source language producers are the contributors to Wikipedia, including volunteers and experts from various fields.
Annotations
Annotation process
The dataset does not include annotations as it is primarily text data from Wikipedia articles.
Who are the annotators?
There are no annotators for this dataset as it is sourced directly from Wikipedia articles.
Personal and Sensitive Information
The dataset does not contain personal or sensitive information as it is sourced from publicly available Wikipedia articles.
Considerations for Using the Data
Social Impact of Dataset
The dataset can be used for various NLP tasks, contributing to advancements in technology and knowledge. However, users should be aware of potential biases present in the data due to the nature of contributions to Wikipedia.
Discussion of Biases
Wikipedia articles may reflect biases present in their contributors. Users of the dataset should be mindful of this and consider potential biases when using the data for research or development.
Other Known Limitations
The dataset does not include images or other multimedia content from Wikipedia articles.
Additional Information
Dataset Curators
The dataset is curated by the Hugging Face team.
Licensing Information
Most of Wikipedias text and many of its images are co-licensed under the Creative Commons Attribution-ShareAlike 3.0 Unported License (CC BY-SA) and the GNU Free Documentation License (GFDL).
Citation Information
@ONLINE{wikidump, author = "Wikimedia Foundation", title = "Wikimedia Downloads", url = "https://dumps.wikimedia.org" }
Contributions
Thanks to @lewtun, @mariamabarham, @thomwolf, @lhoestq, @patrickvonplaten for adding this dataset.



