five

sssOrganization/wikipedia

收藏
Hugging Face2026-01-05 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/sssOrganization/wikipedia
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - ab - ace - ady - af - alt - am - ami - an - ang - anp - ar - arc - ary - arz - as - ast - atj - av - avk - awa - ay - az - azb - ba - ban - bar - bbc - bcl - be - bg - bh - bi - bjn - blk - bm - bn - bo - bpy - br - bs - bug - bxr - ca - cbk - cdo - ce - ceb - ch - chr - chy - ckb - co - cr - crh - cs - csb - cu - cv - cy - da - dag - de - dga - din - diq - dsb - dty - dv - dz - ee - el - eml - en - eo - es - et - eu - ext - fa - fat - ff - fi - fj - fo - fon - fr - frp - frr - fur - fy - ga - gag - gan - gcr - gd - gl - glk - gn - gom - gor - got - gpe - gsw - gu - guc - gur - guw - gv - ha - hak - haw - hbs - he - hi - hif - hr - hsb - ht - hu - hy - hyw - ia - id - ie - ig - ik - ilo - inh - io - is - it - iu - ja - jam - jbo - jv - ka - kaa - kab - kbd - kbp - kcg - kg - ki - kk - kl - km - kn - ko - koi - krc - ks - ksh - ku - kv - kw - ky - la - lad - lb - lbe - lez - lfn - lg - li - lij - lld - lmo - ln - lo - lt - ltg - lv - lzh - mad - mai - map - mdf - mg - mhr - mi - min - mk - ml - mn - mni - mnw - mr - mrj - ms - mt - mwl - my - myv - mzn - nah - nan - nap - nds - ne - new - nia - nl - nn - 'no' - nov - nqo - nrf - nso - nv - ny - oc - olo - om - or - os - pa - pag - pam - pap - pcd - pcm - pdc - pfl - pi - pih - pl - pms - pnb - pnt - ps - pt - pwn - qu - rm - rmy - rn - ro - ru - rue - rup - rw - sa - sah - sat - sc - scn - sco - sd - se - sg - sgs - shi - shn - si - sk - skr - sl - sm - smn - sn - so - sq - sr - srn - ss - st - stq - su - sv - sw - szl - szy - ta - tay - tcy - te - tet - tg - th - ti - tk - tl - tly - tn - to - tpi - tr - trv - ts - tt - tum - tw - ty - tyv - udm - ug - uk - ur - uz - ve - vec - vep - vi - vls - vo - vro - wa - war - wo - wuu - xal - xh - xmf - yi - yo - yue - za - zea - zgh - zh - zu license: - cc-by-sa-3.0 - gfdl size_categories: - n<1K - 1K<n<10K - 10K<n<100K - 100K<n<1M - 1M<n<10M task_categories: - text-generation - fill-mask task_ids: - language-modeling - masked-language-modeling configs: - config_name: 20231101.ab data_files: - split: train path: 20231101.ab/train-* - config_name: 20231101.ace data_files: - split: train path: 20231101.ace/train-* - config_name: 20231101.ady data_files: - split: train path: 20231101.ady/train-* - config_name: 20231101.af data_files: - split: train path: 20231101.af/train-* - config_name: 20231101.als data_files: - split: train path: 20231101.als/train-* - config_name: 20231101.alt data_files: - split: train path: 20231101.alt/train-* - config_name: 20231101.am data_files: - split: train path: 20231101.am/train-* - config_name: 20231101.ami data_files: - split: train path: 20231101.ami/train-* - config_name: 20231101.an data_files: - split: train path: 20231101.an/train-* - config_name: 20231101.ang data_files: - split: train path: 20231101.ang/train-* - config_name: 20231101.anp data_files: - split: train path: 20231101.anp/train-* - config_name: 20231101.ar data_files: - split: train path: 20231101.ar/train-* - config_name: 20231101.arc data_files: - split: train path: 20231101.arc/train-* - config_name: 20231101.ary data_files: - split: train path: 20231101.ary/train-* - config_name: 20231101.arz data_files: - split: train path: 20231101.arz/train-* - config_name: 20231101.as data_files: - split: train path: 20231101.as/train-* - config_name: 20231101.ast data_files: - split: train path: 20231101.ast/train-* - config_name: 20231101.atj data_files: - split: train path: 20231101.atj/train-* - config_name: 20231101.av data_files: - split: train path: 20231101.av/train-* - config_name: 20231101.avk data_files: - split: train path: 20231101.avk/train-* - config_name: 20231101.awa data_files: - split: train path: 20231101.awa/train-* - config_name: 20231101.ay data_files: - split: train path: 20231101.ay/train-* - config_name: 20231101.az data_files: - split: train path: 20231101.az/train-* - config_name: 20231101.azb data_files: - split: train path: 20231101.azb/train-* - config_name: 20231101.ba data_files: - split: train path: 20231101.ba/train-* - config_name: 20231101.ban data_files: - split: train path: 20231101.ban/train-* - config_name: 20231101.bar data_files: - split: train path: 20231101.bar/train-* - config_name: 20231101.bat-smg data_files: - split: train path: 20231101.bat-smg/train-* - config_name: 20231101.bcl data_files: - split: train path: 20231101.bcl/train-* - config_name: 20231101.be data_files: - split: train path: 20231101.be/train-* - config_name: 20231101.be-x-old data_files: - split: train path: 20231101.be-x-old/train-* - config_name: 20231101.bg data_files: - split: train path: 20231101.bg/train-* - config_name: 20231101.bh data_files: - split: train path: 20231101.bh/train-* - config_name: 20231101.bi data_files: - split: train path: 20231101.bi/train-* - config_name: 20231101.bjn data_files: - split: train path: 20231101.bjn/train-* - config_name: 20231101.blk data_files: - split: train path: 20231101.blk/train-* - config_name: 20231101.bm data_files: - split: train path: 20231101.bm/train-* - config_name: 20231101.bn data_files: - split: train path: 20231101.bn/train-* - config_name: 20231101.bo data_files: - split: train path: 20231101.bo/train-* - config_name: 20231101.bpy data_files: - split: train path: 20231101.bpy/train-* - config_name: 20231101.br data_files: - split: train path: 20231101.br/train-* - config_name: 20231101.bs data_files: - split: train path: 20231101.bs/train-* - config_name: 20231101.bug data_files: - split: train path: 20231101.bug/train-* - config_name: 20231101.bxr data_files: - split: train path: 20231101.bxr/train-* - config_name: 20231101.ca data_files: - split: train path: 20231101.ca/train-* - config_name: 20231101.cbk-zam data_files: - split: train path: 20231101.cbk-zam/train-* - config_name: 20231101.cdo data_files: - split: train path: 20231101.cdo/train-* - config_name: 20231101.ce data_files: - split: train path: 20231101.ce/train-* - config_name: 20231101.ceb data_files: - split: train path: 20231101.ceb/train-* - config_name: 20231101.ch data_files: - split: train path: 20231101.ch/train-* - config_name: 20231101.chr data_files: - split: train path: 20231101.chr/train-* - config_name: 20231101.chy data_files: - split: train path: 20231101.chy/train-* - config_name: 20231101.ckb data_files: - split: train path: 20231101.ckb/train-* - config_name: 20231101.co data_files: - split: train path: 20231101.co/train-* - config_name: 20231101.cr data_files: - split: train path: 20231101.cr/train-* - config_name: 20231101.crh data_files: - split: train path: 20231101.crh/train-* - config_name: 20231101.cs data_files: - split: train path: 20231101.cs/train-* - config_name: 20231101.csb data_files: - split: train path: 20231101.csb/train-* - config_name: 20231101.cu data_files: - split: train path: 20231101.cu/train-* - config_name: 20231101.cv data_files: - split: train path: 20231101.cv/train-* - config_name: 20231101.cy data_files: - split: train path: 20231101.cy/train-* - config_name: 20231101.da data_files: - split: train path: 20231101.da/train-* - config_name: 20231101.dag data_files: - split: train path: 20231101.dag/train-* - config_name: 20231101.de data_files: - split: train path: 20231101.de/train-* - config_name: 20231101.din data_files: - split: train path: 20231101.din/train-* - config_name: 20231101.diq data_files: - split: train path: 20231101.diq/train-* - config_name: 20231101.dsb data_files: - split: train path: 20231101.dsb/train-* - config_name: 20231101.dty data_files: - split: train path: 20231101.dty/train-* - config_name: 20231101.dv data_files: - split: train path: 20231101.dv/train-* - config_name: 20231101.dz data_files: - split: train path: 20231101.dz/train-* - config_name: 20231101.ee data_files: - split: train path: 20231101.ee/train-* - config_name: 20231101.el data_files: - split: train path: 20231101.el/train-* - config_name: 20231101.eml data_files: - split: train path: 20231101.eml/train-* - config_name: 20231101.en data_files: - split: train path: 20231101.en/train-* - config_name: 20231101.eo data_files: - split: train path: 20231101.eo/train-* - config_name: 20231101.es data_files: - split: train path: 20231101.es/train-* - config_name: 20231101.et data_files: - split: train path: 20231101.et/train-* - config_name: 20231101.eu data_files: - split: train path: 20231101.eu/train-* - config_name: 20231101.ext data_files: - split: train path: 20231101.ext/train-* - config_name: 20231101.fa data_files: - split: train path: 20231101.fa/train-* - config_name: 20231101.fat data_files: - split: train path: 20231101.fat/train-* - config_name: 20231101.ff data_files: - split: train path: 20231101.ff/train-* - config_name: 20231101.fi data_files: - split: train path: 20231101.fi/train-* - config_name: 20231101.fiu-vro data_files: - split: train path: 20231101.fiu-vro/train-* - config_name: 20231101.fj data_files: - split: train path: 20231101.fj/train-* - config_name: 20231101.fo data_files: - split: train path: 20231101.fo/train-* - config_name: 20231101.fon data_files: - split: train path: 20231101.fon/train-* - config_name: 20231101.fr data_files: - split: train path: 20231101.fr/train-* - config_name: 20231101.frp data_files: - split: train path: 20231101.frp/train-* - config_name: 20231101.frr data_files: - split: train path: 20231101.frr/train-* - config_name: 20231101.fur data_files: - split: train path: 20231101.fur/train-* - config_name: 20231101.fy data_files: - split: train path: 20231101.fy/train-* - config_name: 20231101.ga data_files: - split: train path: 20231101.ga/train-* - config_name: 20231101.gag data_files: - split: train path: 20231101.gag/train-* - config_name: 20231101.gan data_files: - split: train path: 20231101.gan/train-* - config_name: 20231101.gcr data_files: - split: train path: 20231101.gcr/train-* - config_name: 20231101.gd data_files: - split: train path: 20231101.gd/train-* - config_name: 20231101.gl data_files: - split: train path: 20231101.gl/train-* - config_name: 20231101.glk data_files: - split: train path: 20231101.glk/train-* - config_name: 20231101.gn data_files: - split: train path: 20231101.gn/train-* - config_name: 20231101.gom data_files: - split: train path: 20231101.gom/train-* - config_name: 20231101.gor data_files: - split: train path: 20231101.gor/train-* - config_name: 20231101.got data_files: - split: train path: 20231101.got/train-* - config_name: 20231101.gpe data_files: - split: train path: 20231101.gpe/train-* - config_name: 20231101.gu data_files: - split: train path: 20231101.gu/train-* - config_name: 20231101.guc data_files: - split: train path: 20231101.guc/train-* - config_name: 20231101.gur data_files: - split: train path: 20231101.gur/train-* - config_name: 20231101.guw data_files: - split: train path: 20231101.guw/train-* - config_name: 20231101.gv data_files: - split: train path: 20231101.gv/train-* - config_name: 20231101.ha data_files: - split: train path: 20231101.ha/train-* - config_name: 20231101.hak data_files: - split: train path: 20231101.hak/train-* - config_name: 20231101.haw data_files: - split: train path: 20231101.haw/train-* - config_name: 20231101.he data_files: - split: train path: 20231101.he/train-* - config_name: 20231101.hi data_files: - split: train path: 20231101.hi/train-* - config_name: 20231101.hif data_files: - split: train path: 20231101.hif/train-* - config_name: 20231101.hr data_files: - split: train path: 20231101.hr/train-* - config_name: 20231101.hsb data_files: - split: train path: 20231101.hsb/train-* - config_name: 20231101.ht data_files: - split: train path: 20231101.ht/train-* - config_name: 20231101.hu data_files: - split: train path: 20231101.hu/train-* - config_name: 20231101.hy data_files: - split: train path: 20231101.hy/train-* - config_name: 20231101.hyw data_files: - split: train path: 20231101.hyw/train-* - config_name: 20231101.ia data_files: - split: train path: 20231101.ia/train-* - config_name: 20231101.id data_files: - split: train path: 20231101.id/train-* - config_name: 20231101.ie data_files: - split: train path: 20231101.ie/train-* - config_name: 20231101.ig data_files: - split: train path: 20231101.ig/train-* - config_name: 20231101.ik data_files: - split: train path: 20231101.ik/train-* - config_name: 20231101.ilo data_files: - split: train path: 20231101.ilo/train-* - config_name: 20231101.inh data_files: - split: train path: 20231101.inh/train-* - config_name: 20231101.io data_files: - split: train path: 20231101.io/train-* - config_name: 20231101.is data_files: - split: train path: 20231101.is/train-* - config_name: 20231101.it data_files: - split: train path: 20231101.it/train-* - config_name: 20231101.iu data_files: - split: train path: 20231101.iu/train-* - config_name: 20231101.ja data_files: - split: train path: 20231101.ja/train-* - config_name: 20231101.jam data_files: - split: train path: 20231101.jam/train-* - config_name: 20231101.jbo data_files: - split: train path: 20231101.jbo/train-* - config_name: 20231101.jv data_files: - split: train path: 20231101.jv/train-* - config_name: 20231101.ka data_files: - split: train path: 20231101.ka/train-* - config_name: 20231101.kaa data_files: - split: train path: 20231101.kaa/train-* - config_name: 20231101.kab data_files: - split: train path: 20231101.kab/train-* - config_name: 20231101.kbd data_files: - split: train path: 20231101.kbd/train-* - config_name: 20231101.kbp data_files: - split: train path: 20231101.kbp/train-* - config_name: 20231101.kcg data_files: - split: train path: 20231101.kcg/train-* - config_name: 20231101.kg data_files: - split: train path: 20231101.kg/train-* - config_name: 20231101.ki data_files: - split: train path: 20231101.ki/train-* - config_name: 20231101.kk data_files: - split: train path: 20231101.kk/train-* - config_name: 20231101.kl data_files: - split: train path: 20231101.kl/train-* - config_name: 20231101.km data_files: - split: train path: 20231101.km/train-* - config_name: 20231101.kn data_files: - split: train path: 20231101.kn/train-* - config_name: 20231101.ko data_files: - split: train path: 20231101.ko/train-* - config_name: 20231101.koi data_files: - split: train path: 20231101.koi/train-* - config_name: 20231101.krc data_files: - split: train path: 20231101.krc/train-* - config_name: 20231101.ks data_files: - split: train path: 20231101.ks/train-* - config_name: 20231101.ksh data_files: - split: train path: 20231101.ksh/train-* - config_name: 20231101.ku data_files: - split: train path: 20231101.ku/train-* - config_name: 20231101.kv data_files: - split: train path: 20231101.kv/train-* - config_name: 20231101.kw data_files: - split: train path: 20231101.kw/train-* - config_name: 20231101.ky data_files: - split: train path: 20231101.ky/train-* - config_name: 20231101.la data_files: - split: train path: 20231101.la/train-* - config_name: 20231101.lad data_files: - split: train path: 20231101.lad/train-* - config_name: 20231101.lb data_files: - split: train path: 20231101.lb/train-* - config_name: 20231101.lbe data_files: - split: train path: 20231101.lbe/train-* - config_name: 20231101.lez data_files: - split: train path: 20231101.lez/train-* - config_name: 20231101.lfn data_files: - split: train path: 20231101.lfn/train-* - config_name: 20231101.lg data_files: - split: train path: 20231101.lg/train-* - config_name: 20231101.li data_files: - split: train path: 20231101.li/train-* - config_name: 20231101.lij data_files: - split: train path: 20231101.lij/train-* - config_name: 20231101.lld data_files: - split: train path: 20231101.lld/train-* - config_name: 20231101.lmo data_files: - split: train path: 20231101.lmo/train-* - config_name: 20231101.ln data_files: - split: train path: 20231101.ln/train-* - config_name: 20231101.lo data_files: - split: train path: 20231101.lo/train-* - config_name: 20231101.lt data_files: - split: train path: 20231101.lt/train-* - config_name: 20231101.ltg data_files: - split: train path: 20231101.ltg/train-* - config_name: 20231101.lv data_files: - split: train path: 20231101.lv/train-* - config_name: 20231101.mad data_files: - split: train path: 20231101.mad/train-* - config_name: 20231101.mai data_files: - split: train path: 20231101.mai/train-* - config_name: 20231101.map-bms data_files: - split: train path: 20231101.map-bms/train-* - config_name: 20231101.mdf data_files: - split: train path: 20231101.mdf/train-* - config_name: 20231101.mg data_files: - split: train path: 20231101.mg/train-* - config_name: 20231101.mhr data_files: - split: train path: 20231101.mhr/train-* - config_name: 20231101.mi data_files: - split: train path: 20231101.mi/train-* - config_name: 20231101.min data_files: - split: train path: 20231101.min/train-* - config_name: 20231101.mk data_files: - split: train path: 20231101.mk/train-* - config_name: 20231101.ml data_files: - split: train path: 20231101.ml/train-* - config_name: 20231101.mn data_files: - split: train path: 20231101.mn/train-* - config_name: 20231101.mni data_files: - split: train path: 20231101.mni/train-* - config_name: 20231101.mnw data_files: - split: train path: 20231101.mnw/train-* - config_name: 20231101.mr data_files: - split: train path: 20231101.mr/train-* - config_name: 20231101.mrj data_files: - split: train path: 20231101.mrj/train-* - config_name: 20231101.ms data_files: - split: train path: 20231101.ms/train-* - config_name: 20231101.mt data_files: - split: train path: 20231101.mt/train-* - config_name: 20231101.mwl data_files: - split: train path: 20231101.mwl/train-* - config_name: 20231101.my data_files: - split: train path: 20231101.my/train-* - config_name: 20231101.myv data_files: - split: train path: 20231101.myv/train-* - config_name: 20231101.mzn data_files: - split: train path: 20231101.mzn/train-* - config_name: 20231101.nah data_files: - split: train path: 20231101.nah/train-* - config_name: 20231101.nap data_files: - split: train path: 20231101.nap/train-* - config_name: 20231101.nds data_files: - split: train path: 20231101.nds/train-* - config_name: 20231101.nds-nl data_files: - split: train path: 20231101.nds-nl/train-* - config_name: 20231101.ne data_files: - split: train path: 20231101.ne/train-* - config_name: 20231101.new data_files: - split: train path: 20231101.new/train-* - config_name: 20231101.nia data_files: - split: train path: 20231101.nia/train-* - config_name: 20231101.nl data_files: - split: train path: 20231101.nl/train-* - config_name: 20231101.nn data_files: - split: train path: 20231101.nn/train-* - config_name: 20231101.no data_files: - split: train path: 20231101.no/train-* - config_name: 20231101.nov data_files: - split: train path: 20231101.nov/train-* - config_name: 20231101.nqo data_files: - split: train path: 20231101.nqo/train-* - config_name: 20231101.nrm data_files: - split: train path: 20231101.nrm/train-* - config_name: 20231101.nso data_files: - split: train path: 20231101.nso/train-* - config_name: 20231101.nv data_files: - split: train path: 20231101.nv/train-* - config_name: 20231101.ny data_files: - split: train path: 20231101.ny/train-* - config_name: 20231101.oc data_files: - split: train path: 20231101.oc/train-* - config_name: 20231101.olo data_files: - split: train path: 20231101.olo/train-* - config_name: 20231101.om data_files: - split: train path: 20231101.om/train-* - config_name: 20231101.or data_files: - split: train path: 20231101.or/train-* - config_name: 20231101.os data_files: - split: train path: 20231101.os/train-* - config_name: 20231101.pa data_files: - split: train path: 20231101.pa/train-* - config_name: 20231101.pag data_files: - split: train path: 20231101.pag/train-* - config_name: 20231101.pam data_files: - split: train path: 20231101.pam/train-* - config_name: 20231101.pap data_files: - split: train path: 20231101.pap/train-* - config_name: 20231101.pcd data_files: - split: train path: 20231101.pcd/train-* - config_name: 20231101.pcm data_files: - split: train path: 20231101.pcm/train-* - config_name: 20231101.pdc data_files: - split: train path: 20231101.pdc/train-* - config_name: 20231101.pfl data_files: - split: train path: 20231101.pfl/train-* - config_name: 20231101.pi data_files: - split: train path: 20231101.pi/train-* - config_name: 20231101.pih data_files: - split: train path: 20231101.pih/train-* - config_name: 20231101.pl data_files: - split: train path: 20231101.pl/train-* - config_name: 20231101.pms data_files: - split: train path: 20231101.pms/train-* - config_name: 20231101.pnb data_files: - split: train path: 20231101.pnb/train-* - config_name: 20231101.pnt data_files: - split: train path: 20231101.pnt/train-* - config_name: 20231101.ps data_files: - split: train path: 20231101.ps/train-* - config_name: 20231101.pt data_files: - split: train path: 20231101.pt/train-* - config_name: 20231101.pwn data_files: - split: train path: 20231101.pwn/train-* - config_name: 20231101.qu data_files: - split: train path: 20231101.qu/train-* - config_name: 20231101.rm data_files: - split: train path: 20231101.rm/train-* - config_name: 20231101.rmy data_files: - split: train path: 20231101.rmy/train-* - config_name: 20231101.rn data_files: - split: train path: 20231101.rn/train-* - config_name: 20231101.ro data_files: - split: train path: 20231101.ro/train-* - config_name: 20231101.roa-rup data_files: - split: train path: 20231101.roa-rup/train-* - config_name: 20231101.roa-tara data_files: - split: train path: 20231101.roa-tara/train-* - config_name: 20231101.ru data_files: - split: train path: 20231101.ru/train-* - config_name: 20231101.rue data_files: - split: train path: 20231101.rue/train-* - config_name: 20231101.rw data_files: - split: train path: 20231101.rw/train-* - config_name: 20231101.sa data_files: - split: train path: 20231101.sa/train-* - config_name: 20231101.sah data_files: - split: train path: 20231101.sah/train-* - config_name: 20231101.sat data_files: - split: train path: 20231101.sat/train-* - config_name: 20231101.sc data_files: - split: train path: 20231101.sc/train-* - config_name: 20231101.scn data_files: - split: train path: 20231101.scn/train-* - config_name: 20231101.sco data_files: - split: train path: 20231101.sco/train-* - config_name: 20231101.sd data_files: - split: train path: 20231101.sd/train-* - config_name: 20231101.se data_files: - split: train path: 20231101.se/train-* - config_name: 20231101.sg data_files: - split: train path: 20231101.sg/train-* - config_name: 20231101.sh data_files: - split: train path: 20231101.sh/train-* - config_name: 20231101.shi data_files: - split: train path: 20231101.shi/train-* - config_name: 20231101.shn data_files: - split: train path: 20231101.shn/train-* - config_name: 20231101.si data_files: - split: train path: 20231101.si/train-* - config_name: 20231101.simple data_files: - split: train path: 20231101.simple/train-* - config_name: 20231101.sk data_files: - split: train path: 20231101.sk/train-* - config_name: 20231101.skr data_files: - split: train path: 20231101.skr/train-* - config_name: 20231101.sl data_files: - split: train path: 20231101.sl/train-* - config_name: 20231101.sm data_files: - split: train path: 20231101.sm/train-* - config_name: 20231101.smn data_files: - split: train path: 20231101.smn/train-* - config_name: 20231101.sn data_files: - split: train path: 20231101.sn/train-* - config_name: 20231101.so data_files: - split: train path: 20231101.so/train-* - config_name: 20231101.sq data_files: - split: train path: 20231101.sq/train-* - config_name: 20231101.sr data_files: - split: train path: 20231101.sr/train-* - config_name: 20231101.srn data_files: - split: train path: 20231101.srn/train-* - config_name: 20231101.ss data_files: - split: train path: 20231101.ss/train-* - config_name: 20231101.st data_files: - split: train path: 20231101.st/train-* - config_name: 20231101.stq data_files: - split: train path: 20231101.stq/train-* - config_name: 20231101.su data_files: - split: train path: 20231101.su/train-* - config_name: 20231101.sv data_files: - split: train path: 20231101.sv/train-* - config_name: 20231101.sw data_files: - split: train path: 20231101.sw/train-* - config_name: 20231101.szl data_files: - split: train path: 20231101.szl/train-* - config_name: 20231101.szy data_files: - split: train path: 20231101.szy/train-* - config_name: 20231101.ta data_files: - split: train path: 20231101.ta/train-* - config_name: 20231101.tay data_files: - split: train path: 20231101.tay/train-* - config_name: 20231101.tcy data_files: - split: train path: 20231101.tcy/train-* - config_name: 20231101.te data_files: - split: train path: 20231101.te/train-* - config_name: 20231101.tet data_files: - split: train path: 20231101.tet/train-* - config_name: 20231101.tg data_files: - split: train path: 20231101.tg/train-* - config_name: 20231101.th data_files: - split: train path: 20231101.th/train-* - config_name: 20231101.ti data_files: - split: train path: 20231101.ti/train-* - config_name: 20231101.tk data_files: - split: train path: 20231101.tk/train-* - config_name: 20231101.tl data_files: - split: train path: 20231101.tl/train-* - config_name: 20231101.tly data_files: - split: train path: 20231101.tly/train-* - config_name: 20231101.tn data_files: - split: train path: 20231101.tn/train-* - config_name: 20231101.to data_files: - split: train path: 20231101.to/train-* - config_name: 20231101.tpi data_files: - split: train path: 20231101.tpi/train-* - config_name: 20231101.tr data_files: - split: train path: 20231101.tr/train-* - config_name: 20231101.trv data_files: - split: train path: 20231101.trv/train-* - config_name: 20231101.ts data_files: - split: train path: 20231101.ts/train-* - config_name: 20231101.tt data_files: - split: train path: 20231101.tt/train-* - config_name: 20231101.tum data_files: - split: train path: 20231101.tum/train-* - config_name: 20231101.tw data_files: - split: train path: 20231101.tw/train-* - config_name: 20231101.ty data_files: - split: train path: 20231101.ty/train-* - config_name: 20231101.tyv data_files: - split: train path: 20231101.tyv/train-* - config_name: 20231101.udm data_files: - split: train path: 20231101.udm/train-* - config_name: 20231101.ug data_files: - split: train path: 20231101.ug/train-* - config_name: 20231101.uk data_files: - split: train path: 20231101.uk/train-* - config_name: 20231101.ur data_files: - split: train path: 20231101.ur/train-* - config_name: 20231101.uz data_files: - split: train path: 20231101.uz/train-* - config_name: 20231101.ve data_files: - split: train path: 20231101.ve/train-* - config_name: 20231101.vec data_files: - split: train path: 20231101.vec/train-* - config_name: 20231101.vep data_files: - split: train path: 20231101.vep/train-* - config_name: 20231101.vi data_files: - split: train path: 20231101.vi/train-* - config_name: 20231101.vls data_files: - split: train path: 20231101.vls/train-* - config_name: 20231101.vo data_files: - split: train path: 20231101.vo/train-* - config_name: 20231101.wa data_files: - split: train path: 20231101.wa/train-* - config_name: 20231101.war data_files: - split: train path: 20231101.war/train-* - config_name: 20231101.wo data_files: - split: train path: 20231101.wo/train-* - config_name: 20231101.wuu data_files: - split: train path: 20231101.wuu/train-* - config_name: 20231101.xal data_files: - split: train path: 20231101.xal/train-* - config_name: 20231101.xh data_files: - split: train path: 20231101.xh/train-* - config_name: 20231101.xmf data_files: - split: train path: 20231101.xmf/train-* - config_name: 20231101.yi data_files: - split: train path: 20231101.yi/train-* - config_name: 20231101.yo data_files: - split: train path: 20231101.yo/train-* - config_name: 20231101.za data_files: - split: train path: 20231101.za/train-* - config_name: 20231101.zea data_files: - split: train path: 20231101.zea/train-* - config_name: 20231101.zh data_files: - split: train path: 20231101.zh/train-* - config_name: 20231101.zh-classical data_files: - split: train path: 20231101.zh-classical/train-* - config_name: 20231101.zh-min-nan data_files: - split: train path: 20231101.zh-min-nan/train-* - config_name: 20231101.zh-yue data_files: - split: train path: 20231101.zh-yue/train-* - config_name: 20231101.zu data_files: - split: train path: 20231101.zu/train-* dataset_info: - config_name: 20231101.ab features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 4334455 num_examples: 6152 download_size: 1237796 dataset_size: 4334455 - config_name: 20231101.ace features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 5065801 num_examples: 13003 download_size: 1574258 dataset_size: 5065801 - config_name: 20231101.ady features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 765030 num_examples: 706 download_size: 347450 dataset_size: 765030 - config_name: 20231101.af features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 226672176 num_examples: 112518 download_size: 124485544 dataset_size: 226672176 - config_name: 20231101.als features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 81450196 num_examples: 30013 download_size: 49452211 dataset_size: 81450196 - config_name: 20231101.alt features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 6819963 num_examples: 1087 download_size: 2910477 dataset_size: 6819963 - config_name: 20231101.am features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 24218002 num_examples: 13906 download_size: 10720027 dataset_size: 24218002 - config_name: 20231101.ami features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 4460174 num_examples: 1628 download_size: 2261859 dataset_size: 4460174 - config_name: 20231101.an features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 57572050 num_examples: 44249 download_size: 29573020 dataset_size: 57572050 - config_name: 20231101.ang features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 2913906 num_examples: 4121 download_size: 1789811 dataset_size: 2913906 - config_name: 20231101.anp features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 9226211 num_examples: 2749 download_size: 3355979 dataset_size: 9226211 - config_name: 20231101.ar features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 3124486159 num_examples: 1219201 download_size: 1323304271 dataset_size: 3124486159 - config_name: 20231101.arc features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 849731 num_examples: 1936 download_size: 369584 dataset_size: 849731 - config_name: 20231101.ary features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 12049878 num_examples: 8087 download_size: 4672257 dataset_size: 12049878 - config_name: 20231101.arz features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 1402294447 num_examples: 1620194 download_size: 317231585 dataset_size: 1402294447 - config_name: 20231101.as features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 90312333 num_examples: 12338 download_size: 34581561 dataset_size: 90312333 - config_name: 20231101.ast features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 470575521 num_examples: 133419 download_size: 271196430 dataset_size: 470575521 - config_name: 20231101.atj features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 1012467 num_examples: 1971 download_size: 513962 dataset_size: 1012467 - config_name: 20231101.av features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 6084045 num_examples: 3426 download_size: 2573436 dataset_size: 6084045 - config_name: 20231101.avk features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 32119428 num_examples: 28353 download_size: 7984474 dataset_size: 32119428 - config_name: 20231101.awa features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 3703396 num_examples: 3679 download_size: 1269824 dataset_size: 3703396 - config_name: 20231101.ay features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 4395813 num_examples: 5384 download_size: 1756131 dataset_size: 4395813 - config_name: 20231101.az features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 433663157 num_examples: 196158 download_size: 230064038 dataset_size: 433663157 - config_name: 20231101.azb features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 187041147 num_examples: 243376 download_size: 46739926 dataset_size: 187041147 - config_name: 20231101.ba features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 297738837 num_examples: 63319 download_size: 122595805 dataset_size: 297738837 - config_name: 20231101.ban features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 18012727 num_examples: 20986 download_size: 6715876 dataset_size: 18012727 - config_name: 20231101.bar features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 36317102 num_examples: 27096 download_size: 21799389 dataset_size: 36317102 - config_name: 20231101.bat-smg features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 7212849 num_examples: 17221 download_size: 3348765 dataset_size: 7212849 - config_name: 20231101.bcl features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 20394331 num_examples: 15743 download_size: 11369234 dataset_size: 20394331 - config_name: 20231101.be features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 624718980 num_examples: 236165 download_size: 284921288 dataset_size: 624718980 - config_name: 20231101.be-x-old features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 252510447 num_examples: 84361 download_size: 114318588 dataset_size: 252510447 - config_name: 20231101.bg features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 1103334425 num_examples: 294275 download_size: 512344058 dataset_size: 1103334425 - config_name: 20231101.bh features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 16675295 num_examples: 8612 download_size: 5880458 dataset_size: 16675295 - config_name: 20231101.bi features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 404249 num_examples: 1548 download_size: 203610 dataset_size: 404249 - config_name: 20231101.bjn features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 6884860 num_examples: 10519 download_size: 3323032 dataset_size: 6884860 - config_name: 20231101.blk features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 26566991 num_examples: 2946 download_size: 8028430 dataset_size: 26566991 - config_name: 20231101.bm features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 623659 num_examples: 1258 download_size: 343812 dataset_size: 623659 - config_name: 20231101.bn features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 962624238 num_examples: 143069 download_size: 343885999 dataset_size: 962624238 - config_name: 20231101.bo features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 132723880 num_examples: 12881 download_size: 38851784 dataset_size: 132723880 - config_name: 20231101.bpy features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 42975314 num_examples: 25165 download_size: 6568483 dataset_size: 42975314 - config_name: 20231101.br features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 85635744 num_examples: 84340 download_size: 49768597 dataset_size: 85635744 - config_name: 20231101.bs features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 193734399 num_examples: 92596 download_size: 107858627 dataset_size: 193734399 - config_name: 20231101.bug features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 3434889 num_examples: 15880 download_size: 817034 dataset_size: 3434889 - config_name: 20231101.bxr features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 6687172 num_examples: 2791 download_size: 3078699 dataset_size: 6687172 - config_name: 20231101.ca features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 1958810542 num_examples: 737409 download_size: 1116799343 dataset_size: 1958810542 - config_name: 20231101.cbk-zam features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 2061944 num_examples: 3285 download_size: 825899 dataset_size: 2061944 - config_name: 20231101.cdo features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 5109207 num_examples: 16449 download_size: 1982914 dataset_size: 5109207 - config_name: 20231101.ce features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 730387049 num_examples: 601271 download_size: 88393330 dataset_size: 730387049 - config_name: 20231101.ceb features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 4568256711 num_examples: 6122708 download_size: 828085216 dataset_size: 4568256711 - config_name: 20231101.ch features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 178002 num_examples: 576 download_size: 89277 dataset_size: 178002 - config_name: 20231101.chr features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 767618 num_examples: 1113 download_size: 343140 dataset_size: 767618 - config_name: 20231101.chy features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 148139 num_examples: 802 download_size: 75865 dataset_size: 148139 - config_name: 20231101.ckb features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 107150420 num_examples: 52024 download_size: 42964544 dataset_size: 107150420 - config_name: 20231101.co features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 11104243 num_examples: 7799 download_size: 5794731 dataset_size: 11104243 - config_name: 20231101.cr features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 57257 num_examples: 187 download_size: 36081 dataset_size: 57257 - config_name: 20231101.crh features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 9689171 num_examples: 27691 download_size: 3654461 dataset_size: 9689171 - config_name: 20231101.cs features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 1566286962 num_examples: 534044 download_size: 976484249 dataset_size: 1566286962 - config_name: 20231101.csb features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 3748643 num_examples: 5480 download_size: 2055233 dataset_size: 3748643 - config_name: 20231101.cu features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 981592 num_examples: 1235 download_size: 398252 dataset_size: 981592 - config_name: 20231101.cv features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 81873026 num_examples: 51863 download_size: 29640641 dataset_size: 81873026 - config_name: 20231101.cy features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 305837783 num_examples: 279455 download_size: 112257456 dataset_size: 305837783 - config_name: 20231101.da features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 547068330 num_examples: 295347 download_size: 327688122 dataset_size: 547068330 - config_name: 20231101.dag features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 21618973 num_examples: 10071 download_size: 9026986 dataset_size: 21618973 - config_name: 20231101.de features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 9622925305 num_examples: 2845308 download_size: 5771317942 dataset_size: 9622925305 - config_name: 20231101.din features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 564398 num_examples: 512 download_size: 340530 dataset_size: 564398 - config_name: 20231101.diq features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 19671441 num_examples: 41775 download_size: 7616839 dataset_size: 19671441 - config_name: 20231101.dsb features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 3315228 num_examples: 3379 download_size: 1931937 dataset_size: 3315228 - config_name: 20231101.dty features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 7030648 num_examples: 3632 download_size: 2521250 dataset_size: 7030648 - config_name: 20231101.dv features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 13934393 num_examples: 4352 download_size: 5283133 dataset_size: 13934393 - config_name: 20231101.dz features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 8855969 num_examples: 788 download_size: 2583520 dataset_size: 8855969 - config_name: 20231101.ee features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 898491 num_examples: 1181 download_size: 492813 dataset_size: 898491 - config_name: 20231101.el features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 1345589075 num_examples: 226834 download_size: 637372489 dataset_size: 1345589075 - config_name: 20231101.eml features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 3625415 num_examples: 12961 download_size: 1689575 dataset_size: 3625415 - config_name: 20231101.en features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 20200062385 num_examples: 6407814 download_size: 11630929031 dataset_size: 20200062385 - config_name: 20231101.eo features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 523113804 num_examples: 344851 download_size: 297738138 dataset_size: 523113804 - config_name: 20231101.es features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 6033536133 num_examples: 1841155 download_size: 3493595869 dataset_size: 6033536133 - config_name: 20231101.et features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 440177170 num_examples: 240397 download_size: 265444734 dataset_size: 440177170 - config_name: 20231101.eu features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 565567318 num_examples: 416347 download_size: 270355505 dataset_size: 565567318 - config_name: 20231101.ext features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 4389633 num_examples: 3785 download_size: 2761099 dataset_size: 4389633 - config_name: 20231101.fa features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 1899154938 num_examples: 979869 download_size: 759368283 dataset_size: 1899154938 - config_name: 20231101.fat features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 2032812 num_examples: 1122 download_size: 1124684 dataset_size: 2032812 - config_name: 20231101.ff features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 1867995 num_examples: 2419 download_size: 1087702 dataset_size: 1867995 - config_name: 20231101.fi features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 1146146663 num_examples: 561598 download_size: 680512230 dataset_size: 1146146663 - config_name: 20231101.fiu-vro features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 4636361 num_examples: 6590 download_size: 2434159 dataset_size: 4636361 - config_name: 20231101.fj features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 604791 num_examples: 1294 download_size: 328059 dataset_size: 604791 - config_name: 20231101.fo features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 15415249 num_examples: 14080 download_size: 8857239 dataset_size: 15415249 - config_name: 20231101.fon features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 592216 num_examples: 705 download_size: 317444 dataset_size: 592216 - config_name: 20231101.fr features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 8065794826 num_examples: 2564646 download_size: 4614488286 dataset_size: 8065794826 - config_name: 20231101.frp features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 3676441 num_examples: 5766 download_size: 1914046 dataset_size: 3676441 - config_name: 20231101.frr features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 10819914 num_examples: 18666 download_size: 5317694 dataset_size: 10819914 - config_name: 20231101.fur features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 4090412 num_examples: 4001 download_size: 2421238 dataset_size: 4090412 - config_name: 20231101.fy features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 134196708 num_examples: 52416 download_size: 76002257 dataset_size: 134196708 - config_name: 20231101.ga features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 60640820 num_examples: 59156 download_size: 34136733 dataset_size: 60640820 - config_name: 20231101.gag features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 2428849 num_examples: 2968 download_size: 1331866 dataset_size: 2428849 - config_name: 20231101.gan features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 2915229 num_examples: 6743 download_size: 1508844 dataset_size: 2915229 - config_name: 20231101.gcr features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 2338277 num_examples: 2399 download_size: 1345482 dataset_size: 2338277 - config_name: 20231101.gd features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 14051607 num_examples: 15979 download_size: 7190137 dataset_size: 14051607 - config_name: 20231101.gl features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 493905881 num_examples: 200092 download_size: 291104907 dataset_size: 493905881 - config_name: 20231101.glk features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 6086185 num_examples: 7049 download_size: 2382997 dataset_size: 6086185 - config_name: 20231101.gn features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 6921948 num_examples: 5519 download_size: 3806548 dataset_size: 6921948 - config_name: 20231101.gom features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 30889533 num_examples: 4259 download_size: 11306217 dataset_size: 30889533 - config_name: 20231101.gor features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 6369540 num_examples: 15359 download_size: 2101154 dataset_size: 6369540 - config_name: 20231101.got features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 1533770 num_examples: 1013 download_size: 636307 dataset_size: 1533770 - config_name: 20231101.gpe features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 2017667 num_examples: 1110 download_size: 1141261 dataset_size: 2017667 - config_name: 20231101.gu features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 121282557 num_examples: 30445 download_size: 39554078 dataset_size: 121282557 - config_name: 20231101.guc features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 978923 num_examples: 679 download_size: 578311 dataset_size: 978923 - config_name: 20231101.gur features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 2325435 num_examples: 1383 download_size: 1068954 dataset_size: 2325435 - config_name: 20231101.guw features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 1913143 num_examples: 1312 download_size: 1042328 dataset_size: 1913143 - config_name: 20231101.gv features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 6307253 num_examples: 6206 download_size: 3347095 dataset_size: 6307253 - config_name: 20231101.ha features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 77906472 num_examples: 36492 download_size: 43131815 dataset_size: 77906472 - config_name: 20231101.hak features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 4523680 num_examples: 10246 download_size: 1878558 dataset_size: 4523680 - config_name: 20231101.haw features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 1677790 num_examples: 2612 download_size: 696781 dataset_size: 1677790 - config_name: 20231101.he features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 1950200381 num_examples: 333874 download_size: 979183998 dataset_size: 1950200381 - config_name: 20231101.hi features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 672817362 num_examples: 163093 download_size: 237834604 dataset_size: 672817362 - config_name: 20231101.hif features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 5685329 num_examples: 10986 download_size: 2715682 dataset_size: 5685329 - config_name: 20231101.hr features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 443636903 num_examples: 202848 download_size: 275245343 dataset_size: 443636903 - config_name: 20231101.hsb features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 15667118 num_examples: 13957 download_size: 7437491 dataset_size: 15667118 - config_name: 20231101.ht features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 55088040 num_examples: 70159 download_size: 21993952 dataset_size: 55088040 - config_name: 20231101.hu features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 1515899113 num_examples: 532427 download_size: 904857314 dataset_size: 1515899113 - config_name: 20231101.hy features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 1179459973 num_examples: 303036 download_size: 490121120 dataset_size: 1179459973 - config_name: 20231101.hyw features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 59564550 num_examples: 11725 download_size: 27450541 dataset_size: 59564550 - config_name: 20231101.ia features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 16409449 num_examples: 28247 download_size: 8237640 dataset_size: 16409449 - config_name: 20231101.id features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 1125928594 num_examples: 665622 download_size: 583801799 dataset_size: 1125928594 - config_name: 20231101.ie features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 6737711 num_examples: 11877 download_size: 3019044 dataset_size: 6737711 - config_name: 20231101.ig features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 66086115 num_examples: 22908 download_size: 34663540 dataset_size: 66086115 - config_name: 20231101.ik features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 199773 num_examples: 846 download_size: 115758 dataset_size: 199773 - config_name: 20231101.ilo features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 16854494 num_examples: 15371 download_size: 7352572 dataset_size: 16854494 - config_name: 20231101.inh features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 2727253 num_examples: 2123 download_size: 1279524 dataset_size: 2727253 - config_name: 20231101.io features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 38735196 num_examples: 40930 download_size: 17106040 dataset_size: 38735196 - config_name: 20231101.is features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 87856729 num_examples: 57453 download_size: 52286137 dataset_size: 87856729 - config_name: 20231101.it features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 4924856310 num_examples: 1833639 download_size: 2931265519 dataset_size: 4924856310 - config_name: 20231101.iu features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 291185 num_examples: 562 download_size: 136987 dataset_size: 291185 - config_name: 20231101.ja features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 7039610767 num_examples: 1389467 download_size: 3941998526 dataset_size: 7039610767 - config_name: 20231101.jam features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 1142348 num_examples: 1780 download_size: 702664 dataset_size: 1142348 - config_name: 20231101.jbo features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 2523538 num_examples: 1394 download_size: 890356 dataset_size: 2523538 - config_name: 20231101.jv features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 72786688 num_examples: 73380 download_size: 36852134 dataset_size: 72786688 - config_name: 20231101.ka features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 699872960 num_examples: 169602 download_size: 239987665 dataset_size: 699872960 - config_name: 20231101.kaa features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 5139436 num_examples: 4074 download_size: 2913134 dataset_size: 5139436 - config_name: 20231101.kab features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 4392542 num_examples: 5830 download_size: 2580584 dataset_size: 4392542 - config_name: 20231101.kbd features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 3014575 num_examples: 1670 download_size: 1304580 dataset_size: 3014575 - config_name: 20231101.kbp features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 3584563 num_examples: 1931 download_size: 1806400 dataset_size: 3584563 - config_name: 20231101.kcg features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 914665 num_examples: 1151 download_size: 513904 dataset_size: 914665 - config_name: 20231101.kg features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 390163 num_examples: 1329 download_size: 209059 dataset_size: 390163 - config_name: 20231101.ki features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 760980 num_examples: 1668 download_size: 427003 dataset_size: 760980 - config_name: 20231101.kk features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 497917145 num_examples: 238615 download_size: 180750520 dataset_size: 497917145 - config_name: 20231101.kl features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 313658 num_examples: 301 download_size: 193719 dataset_size: 313658 - config_name: 20231101.km features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 103252582 num_examples: 11994 download_size: 35567417 dataset_size: 103252582 - config_name: 20231101.kn features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 402848197 num_examples: 31437 download_size: 147156434 dataset_size: 402848197 - config_name: 20231101.ko features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 1412099944 num_examples: 647897 download_size: 782677061 dataset_size: 1412099944 - config_name: 20231101.koi features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 5103799 num_examples: 3504 download_size: 1888392 dataset_size: 5103799 - config_name: 20231101.krc features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 4589808 num_examples: 2100 download_size: 2022144 dataset_size: 4589808 - config_name: 20231101.ks features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 2868186 num_examples: 4307 download_size: 1094458 dataset_size: 2868186 - config_name: 20231101.ksh features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 3117003 num_examples: 2945 download_size: 2009928 dataset_size: 3117003 - config_name: 20231101.ku features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 44523131 num_examples: 63076 download_size: 22938233 dataset_size: 44523131 - config_name: 20231101.kv features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 9245577 num_examples: 5595 download_size: 3690978 dataset_size: 9245577 - config_name: 20231101.kw features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 4687165 num_examples: 6995 download_size: 2711398 dataset_size: 4687165 - config_name: 20231101.ky features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 166911089 num_examples: 79438 download_size: 63947035 dataset_size: 166911089 - config_name: 20231101.la features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 141080163 num_examples: 138263 download_size: 76588430 dataset_size: 141080163 - config_name: 20231101.lad features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 4901343 num_examples: 3663 download_size: 2754531 dataset_size: 4901343 - config_name: 20231101.lb features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 88826996 num_examples: 62414 download_size: 50515020 dataset_size: 88826996 - config_name: 20231101.lbe features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 745140 num_examples: 1279 download_size: 304394 dataset_size: 745140 - config_name: 20231101.lez features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 9794637 num_examples: 4264 download_size: 3864848 dataset_size: 9794637 - config_name: 20231101.lfn features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 8870685 num_examples: 4832 download_size: 5207546 dataset_size: 8870685 - config_name: 20231101.lg features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 6891539 num_examples: 4048 download_size: 3708097 dataset_size: 6891539 - config_name: 20231101.li features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 29633678 num_examples: 14849 download_size: 17727918 dataset_size: 29633678 - config_name: 20231101.lij features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 11448686 num_examples: 11203 download_size: 6255409 dataset_size: 11448686 - config_name: 20231101.lld features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 50163974 num_examples: 180677 download_size: 13866243 dataset_size: 50163974 - config_name: 20231101.lmo features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 43496783 num_examples: 73510 download_size: 19142356 dataset_size: 43496783 - config_name: 20231101.ln features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 2035050 num_examples: 3534 download_size: 1122138 dataset_size: 2035050 - config_name: 20231101.lo features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 15283258 num_examples: 5014 download_size: 5646554 dataset_size: 15283258 - config_name: 20231101.lt features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 336559824 num_examples: 211292 download_size: 194873569 dataset_size: 336559824 - config_name: 20231101.ltg features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 915364 num_examples: 1070 download_size: 530299 dataset_size: 915364 - config_name: 20231101.lv features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 227272112 num_examples: 123413 download_size: 129739227 dataset_size: 227272112 - config_name: 20231101.mad features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 1596836 num_examples: 1192 download_size: 908630 dataset_size: 1596836 - config_name: 20231101.mai features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 21562856 num_examples: 14714 download_size: 6180231 dataset_size: 21562856 - config_name: 20231101.map-bms features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 5341068 num_examples: 13580 download_size: 2377123 dataset_size: 5341068 - config_name: 20231101.mdf features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 4694770 num_examples: 4257 download_size: 1725294 dataset_size: 4694770 - config_name: 20231101.mg features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 73767229 num_examples: 96316 download_size: 22117304 dataset_size: 73767229 - config_name: 20231101.mhr features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 19249450 num_examples: 11347 download_size: 6902162 dataset_size: 19249450 - config_name: 20231101.mi features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 4169094 num_examples: 7919 download_size: 1044444 dataset_size: 4169094 - config_name: 20231101.min features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 118995918 num_examples: 227143 download_size: 25691303 dataset_size: 118995918 - config_name: 20231101.mk features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 651422351 num_examples: 139559 download_size: 271265486 dataset_size: 651422351 - config_name: 20231101.ml features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 494135127 num_examples: 85791 download_size: 183071274 dataset_size: 494135127 - config_name: 20231101.mn features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 91943210 num_examples: 24048 download_size: 41521786 dataset_size: 91943210 - config_name: 20231101.mni features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 9820483 num_examples: 10894 download_size: 2208525 dataset_size: 9820483 - config_name: 20231101.mnw features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 47237206 num_examples: 3295 download_size: 13765461 dataset_size: 47237206 - config_name: 20231101.mr features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 261879018 num_examples: 94133 download_size: 81991233 dataset_size: 261879018 - config_name: 20231101.mrj features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 8732281 num_examples: 10542 download_size: 3283618 dataset_size: 8732281 - config_name: 20231101.ms features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 423352360 num_examples: 368628 download_size: 210149264 dataset_size: 423352360 - config_name: 20231101.mt features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 32009639 num_examples: 5743 download_size: 18686521 dataset_size: 32009639 - config_name: 20231101.mwl features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 19353725 num_examples: 4500 download_size: 11521563 dataset_size: 19353725 - config_name: 20231101.my features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 314417700 num_examples: 109310 download_size: 85497205 dataset_size: 314417700 - config_name: 20231101.myv features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 11145865 num_examples: 7958 download_size: 4600620 dataset_size: 11145865 - config_name: 20231101.mzn features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 16335757 num_examples: 18717 download_size: 5419390 dataset_size: 16335757 - config_name: 20231101.nah features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 2503320 num_examples: 6218 download_size: 1191779 dataset_size: 2503320 - config_name: 20231101.nap features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 6395706 num_examples: 14884 download_size: 3188122 dataset_size: 6395706 - config_name: 20231101.nds features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 92990126 num_examples: 84285 download_size: 48106879 dataset_size: 92990126 - config_name: 20231101.nds-nl features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 13582403 num_examples: 7847 download_size: 8354427 dataset_size: 13582403 - config_name: 20231101.ne features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 109032486 num_examples: 32885 download_size: 37548833 dataset_size: 109032486 - config_name: 20231101.new features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 159095610 num_examples: 73003 download_size: 20517810 dataset_size: 159095610 - config_name: 20231101.nia features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 2117902 num_examples: 1714 download_size: 1086670 dataset_size: 2117902 - config_name: 20231101.nl features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 2646316266 num_examples: 2135977 download_size: 1436843432 dataset_size: 2646316266 - config_name: 20231101.nn features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 237467406 num_examples: 167653 download_size: 134751873 dataset_size: 237467406 - config_name: 20231101.no features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 1033188011 num_examples: 617937 download_size: 590970350 dataset_size: 1033188011 - config_name: 20231101.nov features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 965640 num_examples: 1693 download_size: 493500 dataset_size: 965640 - config_name: 20231101.nqo features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 8261058 num_examples: 1580 download_size: 3508645 dataset_size: 8261058 - config_name: 20231101.nrm features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 3216817 num_examples: 4902 download_size: 1507257 dataset_size: 3216817 - config_name: 20231101.nso features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 2796467 num_examples: 8650 download_size: 936349 dataset_size: 2796467 - config_name: 20231101.nv features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 16993060 num_examples: 22460 download_size: 3304031 dataset_size: 16993060 - config_name: 20231101.ny features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 1691825 num_examples: 1129 download_size: 938621 dataset_size: 1691825 - config_name: 20231101.oc features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 120092607 num_examples: 89101 download_size: 64043588 dataset_size: 120092607 - config_name: 20231101.olo features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 3173332 num_examples: 4640 download_size: 1724315 dataset_size: 3173332 - config_name: 20231101.om features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 3604768 num_examples: 1970 download_size: 1982849 dataset_size: 3604768 - config_name: 20231101.or features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 75078226 num_examples: 17375 download_size: 26706212 dataset_size: 75078226 - config_name: 20231101.os features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 13182881 num_examples: 17663 download_size: 5572799 dataset_size: 13182881 - config_name: 20231101.pa features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 212972877 num_examples: 51423 download_size: 81452929 dataset_size: 212972877 - config_name: 20231101.pag features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 1391816 num_examples: 2665 download_size: 455808 dataset_size: 1391816 - config_name: 20231101.pam features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 8294902 num_examples: 9006 download_size: 4277038 dataset_size: 8294902 - config_name: 20231101.pap features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 4251480 num_examples: 3520 download_size: 2435005 dataset_size: 4251480 - config_name: 20231101.pcd features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 5704321 num_examples: 5717 download_size: 3145572 dataset_size: 5704321 - config_name: 20231101.pcm features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 1886987 num_examples: 1238 download_size: 1160762 dataset_size: 1886987 - config_name: 20231101.pdc features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 1225978 num_examples: 2176 download_size: 698254 dataset_size: 1225978 - config_name: 20231101.pfl features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 3694464 num_examples: 2762 download_size: 1971214 dataset_size: 3694464 - config_name: 20231101.pi features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 1144100 num_examples: 3057 download_size: 200764 dataset_size: 1144100 - config_name: 20231101.pih features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 278139 num_examples: 934 download_size: 177092 dataset_size: 278139 - config_name: 20231101.pl features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 2950148809 num_examples: 1587721 download_size: 1765059986 dataset_size: 2950148809 - config_name: 20231101.pms features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 34340217 num_examples: 67980 download_size: 12008880 dataset_size: 34340217 - config_name: 20231101.pnb features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 304117649 num_examples: 72307 download_size: 133266242 dataset_size: 304117649 - config_name: 20231101.pnt features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 630636 num_examples: 533 download_size: 275639 dataset_size: 630636 - config_name: 20231101.ps features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 114259737 num_examples: 20529 download_size: 53312545 dataset_size: 114259737 - config_name: 20231101.pt features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 2758783436 num_examples: 1112246 download_size: 1579641059 dataset_size: 2758783436 - config_name: 20231101.pwn features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 811954 num_examples: 408 download_size: 444109 dataset_size: 811954 - config_name: 20231101.qu features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 16828457 num_examples: 24196 download_size: 7688106 dataset_size: 16828457 - config_name: 20231101.rm features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 18053014 num_examples: 3822 download_size: 10483970 dataset_size: 18053014 - config_name: 20231101.rmy features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 611778 num_examples: 1279 download_size: 356457 dataset_size: 611778 - config_name: 20231101.rn features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 530318 num_examples: 819 download_size: 301252 dataset_size: 530318 - config_name: 20231101.ro features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 847410736 num_examples: 442389 download_size: 466937380 dataset_size: 847410736 - config_name: 20231101.roa-rup features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 1687829 num_examples: 1432 download_size: 951677 dataset_size: 1687829 - config_name: 20231101.roa-tara features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 7470331 num_examples: 9367 download_size: 4003095 dataset_size: 7470331 - config_name: 20231101.ru features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 10277958919 num_examples: 1945063 download_size: 4876849588 dataset_size: 10277958919 - config_name: 20231101.rue features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 13128572 num_examples: 8759 download_size: 6346106 dataset_size: 13128572 - config_name: 20231101.rw features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 11898854 num_examples: 8063 download_size: 6623388 dataset_size: 11898854 - config_name: 20231101.sa features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 69854997 num_examples: 12156 download_size: 23850161 dataset_size: 69854997 - config_name: 20231101.sah features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 48562374 num_examples: 17098 download_size: 21675888 dataset_size: 48562374 - config_name: 20231101.sat features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 45247783 num_examples: 9767 download_size: 15428584 dataset_size: 45247783 - config_name: 20231101.sc features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 12776438 num_examples: 7586 download_size: 7711996 dataset_size: 12776438 - config_name: 20231101.scn features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 17685098 num_examples: 26530 download_size: 10223816 dataset_size: 17685098 - config_name: 20231101.sco features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 42808738 num_examples: 35276 download_size: 24287944 dataset_size: 42808738 - config_name: 20231101.sd features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 37021659 num_examples: 16928 download_size: 17591997 dataset_size: 37021659 - config_name: 20231101.se features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 3600527 num_examples: 8043 download_size: 1816006 dataset_size: 3600527 - config_name: 20231101.sg features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 140127 num_examples: 564 download_size: 72486 dataset_size: 140127 - config_name: 20231101.sh features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 569225870 num_examples: 458392 download_size: 266379293 dataset_size: 569225870 - config_name: 20231101.shi features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 2369002 num_examples: 1779 download_size: 1359828 dataset_size: 2369002 - config_name: 20231101.shn features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 33553593 num_examples: 13945 download_size: 8163231 dataset_size: 33553593 - config_name: 20231101.si features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 138806443 num_examples: 23065 download_size: 54229127 dataset_size: 138806443 - config_name: 20231101.simple features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 291254232 num_examples: 241787 download_size: 156885218 dataset_size: 291254232 - config_name: 20231101.sk features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 416804817 num_examples: 242235 download_size: 239513292 dataset_size: 416804817 - config_name: 20231101.skr features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 22705446 num_examples: 5819 download_size: 9978607 dataset_size: 22705446 - config_name: 20231101.sl features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 454829910 num_examples: 183006 download_size: 267485569 dataset_size: 454829910 - config_name: 20231101.sm features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 902927 num_examples: 1151 download_size: 492349 dataset_size: 902927 - config_name: 20231101.smn features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 5764244 num_examples: 5383 download_size: 2813872 dataset_size: 5764244 - config_name: 20231101.sn features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 9790528 num_examples: 11621 download_size: 4979456 dataset_size: 9790528 - config_name: 20231101.so features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 13663784 num_examples: 9021 download_size: 7940363 dataset_size: 13663784 - config_name: 20231101.sq features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 208779652 num_examples: 104854 download_size: 116945494 dataset_size: 208779652 - config_name: 20231101.sr features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 1721596392 num_examples: 676605 download_size: 697391786 dataset_size: 1721596392 - config_name: 20231101.srn features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 649317 num_examples: 1219 download_size: 215103 dataset_size: 649317 - config_name: 20231101.ss features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 1076102 num_examples: 945 download_size: 600997 dataset_size: 1076102 - config_name: 20231101.st features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 968161 num_examples: 1099 download_size: 530165 dataset_size: 968161 - config_name: 20231101.stq features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 4942784 num_examples: 4134 download_size: 2884429 dataset_size: 4942784 - config_name: 20231101.su features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 48066965 num_examples: 61555 download_size: 19806020 dataset_size: 48066965 - config_name: 20231101.sv features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 2153690744 num_examples: 2574513 download_size: 974261228 dataset_size: 2153690744 - config_name: 20231101.sw features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 73119299 num_examples: 78587 download_size: 35936177 dataset_size: 73119299 - config_name: 20231101.szl features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 21439309 num_examples: 57035 download_size: 7347967 dataset_size: 21439309 - config_name: 20231101.szy features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 11355780 num_examples: 4885 download_size: 6192815 dataset_size: 11355780 - config_name: 20231101.ta features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 810734099 num_examples: 160651 download_size: 265652020 dataset_size: 810734099 - config_name: 20231101.tay features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 2974229 num_examples: 2747 download_size: 1232811 dataset_size: 2974229 - config_name: 20231101.tcy features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 12166612 num_examples: 2202 download_size: 4611006 dataset_size: 12166612 - config_name: 20231101.te features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 730376585 num_examples: 87854 download_size: 215097076 dataset_size: 730376585 - config_name: 20231101.tet features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 1466200 num_examples: 1468 download_size: 744390 dataset_size: 1466200 - config_name: 20231101.tg features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 148256281 num_examples: 110962 download_size: 49825647 dataset_size: 148256281 - config_name: 20231101.th features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 1014547923 num_examples: 159719 download_size: 371916105 dataset_size: 1014547923 - config_name: 20231101.ti features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 729995 num_examples: 435 download_size: 363723 dataset_size: 729995 - config_name: 20231101.tk features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 13326412 num_examples: 7918 download_size: 7383654 dataset_size: 13326412 - config_name: 20231101.tl features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 85794472 num_examples: 45341 download_size: 45797527 dataset_size: 85794472 - config_name: 20231101.tly features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 2590482 num_examples: 8086 download_size: 1070456 dataset_size: 2590482 - config_name: 20231101.tn features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 4380768 num_examples: 1585 download_size: 1708110 dataset_size: 4380768 - config_name: 20231101.to features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 1090611 num_examples: 1887 download_size: 518244 dataset_size: 1090611 - config_name: 20231101.tpi features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 460420 num_examples: 1399 download_size: 241908 dataset_size: 460420 - config_name: 20231101.tr features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 997254242 num_examples: 534988 download_size: 552923659 dataset_size: 997254242 - config_name: 20231101.trv features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 4971204 num_examples: 1880 download_size: 2706664 dataset_size: 4971204 - config_name: 20231101.ts features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 847032 num_examples: 785 download_size: 455648 dataset_size: 847032 - config_name: 20231101.tt features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 681325421 num_examples: 501116 download_size: 129141056 dataset_size: 681325421 - config_name: 20231101.tum features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 13429984 num_examples: 18708 download_size: 5459856 dataset_size: 13429984 - config_name: 20231101.tw features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 7982767 num_examples: 3978 download_size: 4118530 dataset_size: 7982767 - config_name: 20231101.ty features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 338743 num_examples: 1355 download_size: 150963 dataset_size: 338743 - config_name: 20231101.tyv features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 14324694 num_examples: 3491 download_size: 6528290 dataset_size: 14324694 - config_name: 20231101.udm features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 7036113 num_examples: 5677 download_size: 2982821 dataset_size: 7036113 - config_name: 20231101.ug features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 42254159 num_examples: 8634 download_size: 17741860 dataset_size: 42254159 - config_name: 20231101.uk features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 4969483901 num_examples: 1294720 download_size: 2276769383 dataset_size: 4969483901 - config_name: 20231101.ur features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 410511855 num_examples: 200154 download_size: 167627869 dataset_size: 410511855 - config_name: 20231101.uz features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 397176774 num_examples: 246729 download_size: 210262652 dataset_size: 397176774 - config_name: 20231101.ve features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 359542 num_examples: 840 download_size: 163318 dataset_size: 359542 - config_name: 20231101.vec features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 37917528 num_examples: 69268 download_size: 16179506 dataset_size: 37917528 - config_name: 20231101.vep features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 11643856 num_examples: 6960 download_size: 6423002 dataset_size: 11643856 - config_name: 20231101.vi features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 1617830227 num_examples: 1288680 download_size: 729557588 dataset_size: 1617830227 - config_name: 20231101.vls features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 11336278 num_examples: 7872 download_size: 6985406 dataset_size: 11336278 - config_name: 20231101.vo features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 19521708 num_examples: 35193 download_size: 6582571 dataset_size: 19521708 - config_name: 20231101.wa features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 12268826 num_examples: 12038 download_size: 7327616 dataset_size: 12268826 - config_name: 20231101.war features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 467647882 num_examples: 1266394 download_size: 104588442 dataset_size: 467647882 - config_name: 20231101.wo features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 3525303 num_examples: 1746 download_size: 2094574 dataset_size: 3525303 - config_name: 20231101.wuu features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 25029545 num_examples: 43010 download_size: 15985963 dataset_size: 25029545 - config_name: 20231101.xal features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 1391731 num_examples: 2295 download_size: 507198 dataset_size: 1391731 - config_name: 20231101.xh features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 3665998 num_examples: 1883 download_size: 2505472 dataset_size: 3665998 - config_name: 20231101.xmf features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 37712629 num_examples: 18099 download_size: 12948576 dataset_size: 37712629 - config_name: 20231101.yi features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 36038273 num_examples: 15179 download_size: 16218296 dataset_size: 36038273 - config_name: 20231101.yo features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 19081408 num_examples: 33819 download_size: 8861465 dataset_size: 19081408 - config_name: 20231101.za features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 1365300 num_examples: 2993 download_size: 666521 dataset_size: 1365300 - config_name: 20231101.zea features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 5224563 num_examples: 6082 download_size: 2620396 dataset_size: 5224563 - config_name: 20231101.zh features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 2790577882 num_examples: 1384748 download_size: 1721150260 dataset_size: 2790577882 - config_name: 20231101.zh-classical features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 14869227 num_examples: 12708 download_size: 10098073 dataset_size: 14869227 - config_name: 20231101.zh-min-nan features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 153672031 num_examples: 432798 download_size: 37122048 dataset_size: 153672031 - config_name: 20231101.zh-yue features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 109936351 num_examples: 134140 download_size: 64950815 dataset_size: 109936351 - config_name: 20231101.zu features: - name: id dtype: string - name: url dtype: string - name: title dtype: string - name: text dtype: string splits: - name: train num_bytes: 7088246 num_examples: 11561 download_size: 3792429 dataset_size: 7088246 language_bcp47: - be-tarask - en-simple --- # Dataset Card for Wikimedia Wikipedia ## Table of Contents - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [https://dumps.wikimedia.org](https://dumps.wikimedia.org) - **Repository:** - **Paper:** - **Point of Contact:** ### Dataset Summary Wikipedia dataset containing cleaned articles of all languages. The dataset is built from the Wikipedia dumps (https://dumps.wikimedia.org/) with one subset per language, each containing a single train split. Each example contains the content of one full Wikipedia article with cleaning to strip markdown and unwanted sections (references, etc.). All language subsets have already been processed for recent dump, and you can load them per date and language this way: ```python from datasets import load_dataset ds = load_dataset("wikimedia/wikipedia", "20231101.en") ``` #### Data Visualization Click the [Nomic Atlas](https://atlas.nomic.ai/map/475c26d7-b142-4795-9887-02b6eeb18dc0/0d312be6-a3bb-4586-b6b7-53dcd0cbefa5) map below to visualize the 6.4 million samples in the `20231101.en` split. <a href="https://atlas.nomic.ai/map/475c26d7-b142-4795-9887-02b6eeb18dc0/0d312be6-a3bb-4586-b6b7-53dcd0cbefa5"> <img src="https://cdn-uploads.huggingface.co/production/uploads/6480c476cacb1c4a0696eeb8/sZNN6Vubc0Oue83vKaJUu.webp" alt="Nomic-Atlas Wikipedia Map" width="25%"/> </a> ### Supported Tasks and Leaderboards The dataset is generally used for Language Modeling. ### Languages You can find the list of languages here: https://meta.wikimedia.org/wiki/List_of_Wikipedias ## Dataset Structure ### Data Instances An example looks as follows: ``` {'id': '1', 'url': 'https://simple.wikipedia.org/wiki/April', 'title': 'April', 'text': 'April is the fourth month...' } ``` ### Data Fields The data fields are the same among all configurations: - `id` (`str`): ID of the article. - `url` (`str`): URL of the article. - `title` (`str`): Title of the article. - `text` (`str`): Text content of the article. ### Data Splits All configurations contain a single `train` split. ## Dataset Creation ### Curation Rationale [More Information Needed] ### Source Data #### Initial Data Collection and Normalization The dataset is built from the Wikipedia dumps: https://dumps.wikimedia.org You can find the full list of languages and dates here: https://dumps.wikimedia.org/backup-index.html The articles have been parsed using the [`mwparserfromhell`](https://mwparserfromhell.readthedocs.io) tool. When uploading the data files for the 20231101 dump, we noticed that the Wikimedia Dumps website does not contain this date dump for the "bbc", "dga", nor "zgh" Wikipedias. We have reported the issue to the Wikimedia Phabricator: https://phabricator.wikimedia.org/T351761 #### Who are the source language producers? [More Information Needed] ### Annotations #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information [More Information Needed] ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information Copyright licensing information: https://dumps.wikimedia.org/legal.html All original textual content is licensed under the [GNU Free Documentation License](https://www.gnu.org/licenses/fdl-1.3.html) (GFDL) and the [Creative Commons Attribution-Share-Alike 3.0 License](https://creativecommons.org/licenses/by-sa/3.0/). Some text may be available only under the Creative Commons license; see their [Terms of Use](https://foundation.wikimedia.org/wiki/Policy:Terms_of_Use) for details. Text written by some authors may be released under additional licenses or into the public domain. ### Citation Information ``` @ONLINE{wikidump, author = "Wikimedia Foundation", title = "Wikimedia Downloads", url = "https://dumps.wikimedia.org" } ```
提供机构:
sssOrganization
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作