sssOrganization/wikipedia
收藏Hugging Face2026-01-05 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/sssOrganization/wikipedia
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- ab
- ace
- ady
- af
- alt
- am
- ami
- an
- ang
- anp
- ar
- arc
- ary
- arz
- as
- ast
- atj
- av
- avk
- awa
- ay
- az
- azb
- ba
- ban
- bar
- bbc
- bcl
- be
- bg
- bh
- bi
- bjn
- blk
- bm
- bn
- bo
- bpy
- br
- bs
- bug
- bxr
- ca
- cbk
- cdo
- ce
- ceb
- ch
- chr
- chy
- ckb
- co
- cr
- crh
- cs
- csb
- cu
- cv
- cy
- da
- dag
- de
- dga
- din
- diq
- dsb
- dty
- dv
- dz
- ee
- el
- eml
- en
- eo
- es
- et
- eu
- ext
- fa
- fat
- ff
- fi
- fj
- fo
- fon
- fr
- frp
- frr
- fur
- fy
- ga
- gag
- gan
- gcr
- gd
- gl
- glk
- gn
- gom
- gor
- got
- gpe
- gsw
- gu
- guc
- gur
- guw
- gv
- ha
- hak
- haw
- hbs
- he
- hi
- hif
- hr
- hsb
- ht
- hu
- hy
- hyw
- ia
- id
- ie
- ig
- ik
- ilo
- inh
- io
- is
- it
- iu
- ja
- jam
- jbo
- jv
- ka
- kaa
- kab
- kbd
- kbp
- kcg
- kg
- ki
- kk
- kl
- km
- kn
- ko
- koi
- krc
- ks
- ksh
- ku
- kv
- kw
- ky
- la
- lad
- lb
- lbe
- lez
- lfn
- lg
- li
- lij
- lld
- lmo
- ln
- lo
- lt
- ltg
- lv
- lzh
- mad
- mai
- map
- mdf
- mg
- mhr
- mi
- min
- mk
- ml
- mn
- mni
- mnw
- mr
- mrj
- ms
- mt
- mwl
- my
- myv
- mzn
- nah
- nan
- nap
- nds
- ne
- new
- nia
- nl
- nn
- 'no'
- nov
- nqo
- nrf
- nso
- nv
- ny
- oc
- olo
- om
- or
- os
- pa
- pag
- pam
- pap
- pcd
- pcm
- pdc
- pfl
- pi
- pih
- pl
- pms
- pnb
- pnt
- ps
- pt
- pwn
- qu
- rm
- rmy
- rn
- ro
- ru
- rue
- rup
- rw
- sa
- sah
- sat
- sc
- scn
- sco
- sd
- se
- sg
- sgs
- shi
- shn
- si
- sk
- skr
- sl
- sm
- smn
- sn
- so
- sq
- sr
- srn
- ss
- st
- stq
- su
- sv
- sw
- szl
- szy
- ta
- tay
- tcy
- te
- tet
- tg
- th
- ti
- tk
- tl
- tly
- tn
- to
- tpi
- tr
- trv
- ts
- tt
- tum
- tw
- ty
- tyv
- udm
- ug
- uk
- ur
- uz
- ve
- vec
- vep
- vi
- vls
- vo
- vro
- wa
- war
- wo
- wuu
- xal
- xh
- xmf
- yi
- yo
- yue
- za
- zea
- zgh
- zh
- zu
license:
- cc-by-sa-3.0
- gfdl
size_categories:
- n<1K
- 1K<n<10K
- 10K<n<100K
- 100K<n<1M
- 1M<n<10M
task_categories:
- text-generation
- fill-mask
task_ids:
- language-modeling
- masked-language-modeling
configs:
- config_name: 20231101.ab
data_files:
- split: train
path: 20231101.ab/train-*
- config_name: 20231101.ace
data_files:
- split: train
path: 20231101.ace/train-*
- config_name: 20231101.ady
data_files:
- split: train
path: 20231101.ady/train-*
- config_name: 20231101.af
data_files:
- split: train
path: 20231101.af/train-*
- config_name: 20231101.als
data_files:
- split: train
path: 20231101.als/train-*
- config_name: 20231101.alt
data_files:
- split: train
path: 20231101.alt/train-*
- config_name: 20231101.am
data_files:
- split: train
path: 20231101.am/train-*
- config_name: 20231101.ami
data_files:
- split: train
path: 20231101.ami/train-*
- config_name: 20231101.an
data_files:
- split: train
path: 20231101.an/train-*
- config_name: 20231101.ang
data_files:
- split: train
path: 20231101.ang/train-*
- config_name: 20231101.anp
data_files:
- split: train
path: 20231101.anp/train-*
- config_name: 20231101.ar
data_files:
- split: train
path: 20231101.ar/train-*
- config_name: 20231101.arc
data_files:
- split: train
path: 20231101.arc/train-*
- config_name: 20231101.ary
data_files:
- split: train
path: 20231101.ary/train-*
- config_name: 20231101.arz
data_files:
- split: train
path: 20231101.arz/train-*
- config_name: 20231101.as
data_files:
- split: train
path: 20231101.as/train-*
- config_name: 20231101.ast
data_files:
- split: train
path: 20231101.ast/train-*
- config_name: 20231101.atj
data_files:
- split: train
path: 20231101.atj/train-*
- config_name: 20231101.av
data_files:
- split: train
path: 20231101.av/train-*
- config_name: 20231101.avk
data_files:
- split: train
path: 20231101.avk/train-*
- config_name: 20231101.awa
data_files:
- split: train
path: 20231101.awa/train-*
- config_name: 20231101.ay
data_files:
- split: train
path: 20231101.ay/train-*
- config_name: 20231101.az
data_files:
- split: train
path: 20231101.az/train-*
- config_name: 20231101.azb
data_files:
- split: train
path: 20231101.azb/train-*
- config_name: 20231101.ba
data_files:
- split: train
path: 20231101.ba/train-*
- config_name: 20231101.ban
data_files:
- split: train
path: 20231101.ban/train-*
- config_name: 20231101.bar
data_files:
- split: train
path: 20231101.bar/train-*
- config_name: 20231101.bat-smg
data_files:
- split: train
path: 20231101.bat-smg/train-*
- config_name: 20231101.bcl
data_files:
- split: train
path: 20231101.bcl/train-*
- config_name: 20231101.be
data_files:
- split: train
path: 20231101.be/train-*
- config_name: 20231101.be-x-old
data_files:
- split: train
path: 20231101.be-x-old/train-*
- config_name: 20231101.bg
data_files:
- split: train
path: 20231101.bg/train-*
- config_name: 20231101.bh
data_files:
- split: train
path: 20231101.bh/train-*
- config_name: 20231101.bi
data_files:
- split: train
path: 20231101.bi/train-*
- config_name: 20231101.bjn
data_files:
- split: train
path: 20231101.bjn/train-*
- config_name: 20231101.blk
data_files:
- split: train
path: 20231101.blk/train-*
- config_name: 20231101.bm
data_files:
- split: train
path: 20231101.bm/train-*
- config_name: 20231101.bn
data_files:
- split: train
path: 20231101.bn/train-*
- config_name: 20231101.bo
data_files:
- split: train
path: 20231101.bo/train-*
- config_name: 20231101.bpy
data_files:
- split: train
path: 20231101.bpy/train-*
- config_name: 20231101.br
data_files:
- split: train
path: 20231101.br/train-*
- config_name: 20231101.bs
data_files:
- split: train
path: 20231101.bs/train-*
- config_name: 20231101.bug
data_files:
- split: train
path: 20231101.bug/train-*
- config_name: 20231101.bxr
data_files:
- split: train
path: 20231101.bxr/train-*
- config_name: 20231101.ca
data_files:
- split: train
path: 20231101.ca/train-*
- config_name: 20231101.cbk-zam
data_files:
- split: train
path: 20231101.cbk-zam/train-*
- config_name: 20231101.cdo
data_files:
- split: train
path: 20231101.cdo/train-*
- config_name: 20231101.ce
data_files:
- split: train
path: 20231101.ce/train-*
- config_name: 20231101.ceb
data_files:
- split: train
path: 20231101.ceb/train-*
- config_name: 20231101.ch
data_files:
- split: train
path: 20231101.ch/train-*
- config_name: 20231101.chr
data_files:
- split: train
path: 20231101.chr/train-*
- config_name: 20231101.chy
data_files:
- split: train
path: 20231101.chy/train-*
- config_name: 20231101.ckb
data_files:
- split: train
path: 20231101.ckb/train-*
- config_name: 20231101.co
data_files:
- split: train
path: 20231101.co/train-*
- config_name: 20231101.cr
data_files:
- split: train
path: 20231101.cr/train-*
- config_name: 20231101.crh
data_files:
- split: train
path: 20231101.crh/train-*
- config_name: 20231101.cs
data_files:
- split: train
path: 20231101.cs/train-*
- config_name: 20231101.csb
data_files:
- split: train
path: 20231101.csb/train-*
- config_name: 20231101.cu
data_files:
- split: train
path: 20231101.cu/train-*
- config_name: 20231101.cv
data_files:
- split: train
path: 20231101.cv/train-*
- config_name: 20231101.cy
data_files:
- split: train
path: 20231101.cy/train-*
- config_name: 20231101.da
data_files:
- split: train
path: 20231101.da/train-*
- config_name: 20231101.dag
data_files:
- split: train
path: 20231101.dag/train-*
- config_name: 20231101.de
data_files:
- split: train
path: 20231101.de/train-*
- config_name: 20231101.din
data_files:
- split: train
path: 20231101.din/train-*
- config_name: 20231101.diq
data_files:
- split: train
path: 20231101.diq/train-*
- config_name: 20231101.dsb
data_files:
- split: train
path: 20231101.dsb/train-*
- config_name: 20231101.dty
data_files:
- split: train
path: 20231101.dty/train-*
- config_name: 20231101.dv
data_files:
- split: train
path: 20231101.dv/train-*
- config_name: 20231101.dz
data_files:
- split: train
path: 20231101.dz/train-*
- config_name: 20231101.ee
data_files:
- split: train
path: 20231101.ee/train-*
- config_name: 20231101.el
data_files:
- split: train
path: 20231101.el/train-*
- config_name: 20231101.eml
data_files:
- split: train
path: 20231101.eml/train-*
- config_name: 20231101.en
data_files:
- split: train
path: 20231101.en/train-*
- config_name: 20231101.eo
data_files:
- split: train
path: 20231101.eo/train-*
- config_name: 20231101.es
data_files:
- split: train
path: 20231101.es/train-*
- config_name: 20231101.et
data_files:
- split: train
path: 20231101.et/train-*
- config_name: 20231101.eu
data_files:
- split: train
path: 20231101.eu/train-*
- config_name: 20231101.ext
data_files:
- split: train
path: 20231101.ext/train-*
- config_name: 20231101.fa
data_files:
- split: train
path: 20231101.fa/train-*
- config_name: 20231101.fat
data_files:
- split: train
path: 20231101.fat/train-*
- config_name: 20231101.ff
data_files:
- split: train
path: 20231101.ff/train-*
- config_name: 20231101.fi
data_files:
- split: train
path: 20231101.fi/train-*
- config_name: 20231101.fiu-vro
data_files:
- split: train
path: 20231101.fiu-vro/train-*
- config_name: 20231101.fj
data_files:
- split: train
path: 20231101.fj/train-*
- config_name: 20231101.fo
data_files:
- split: train
path: 20231101.fo/train-*
- config_name: 20231101.fon
data_files:
- split: train
path: 20231101.fon/train-*
- config_name: 20231101.fr
data_files:
- split: train
path: 20231101.fr/train-*
- config_name: 20231101.frp
data_files:
- split: train
path: 20231101.frp/train-*
- config_name: 20231101.frr
data_files:
- split: train
path: 20231101.frr/train-*
- config_name: 20231101.fur
data_files:
- split: train
path: 20231101.fur/train-*
- config_name: 20231101.fy
data_files:
- split: train
path: 20231101.fy/train-*
- config_name: 20231101.ga
data_files:
- split: train
path: 20231101.ga/train-*
- config_name: 20231101.gag
data_files:
- split: train
path: 20231101.gag/train-*
- config_name: 20231101.gan
data_files:
- split: train
path: 20231101.gan/train-*
- config_name: 20231101.gcr
data_files:
- split: train
path: 20231101.gcr/train-*
- config_name: 20231101.gd
data_files:
- split: train
path: 20231101.gd/train-*
- config_name: 20231101.gl
data_files:
- split: train
path: 20231101.gl/train-*
- config_name: 20231101.glk
data_files:
- split: train
path: 20231101.glk/train-*
- config_name: 20231101.gn
data_files:
- split: train
path: 20231101.gn/train-*
- config_name: 20231101.gom
data_files:
- split: train
path: 20231101.gom/train-*
- config_name: 20231101.gor
data_files:
- split: train
path: 20231101.gor/train-*
- config_name: 20231101.got
data_files:
- split: train
path: 20231101.got/train-*
- config_name: 20231101.gpe
data_files:
- split: train
path: 20231101.gpe/train-*
- config_name: 20231101.gu
data_files:
- split: train
path: 20231101.gu/train-*
- config_name: 20231101.guc
data_files:
- split: train
path: 20231101.guc/train-*
- config_name: 20231101.gur
data_files:
- split: train
path: 20231101.gur/train-*
- config_name: 20231101.guw
data_files:
- split: train
path: 20231101.guw/train-*
- config_name: 20231101.gv
data_files:
- split: train
path: 20231101.gv/train-*
- config_name: 20231101.ha
data_files:
- split: train
path: 20231101.ha/train-*
- config_name: 20231101.hak
data_files:
- split: train
path: 20231101.hak/train-*
- config_name: 20231101.haw
data_files:
- split: train
path: 20231101.haw/train-*
- config_name: 20231101.he
data_files:
- split: train
path: 20231101.he/train-*
- config_name: 20231101.hi
data_files:
- split: train
path: 20231101.hi/train-*
- config_name: 20231101.hif
data_files:
- split: train
path: 20231101.hif/train-*
- config_name: 20231101.hr
data_files:
- split: train
path: 20231101.hr/train-*
- config_name: 20231101.hsb
data_files:
- split: train
path: 20231101.hsb/train-*
- config_name: 20231101.ht
data_files:
- split: train
path: 20231101.ht/train-*
- config_name: 20231101.hu
data_files:
- split: train
path: 20231101.hu/train-*
- config_name: 20231101.hy
data_files:
- split: train
path: 20231101.hy/train-*
- config_name: 20231101.hyw
data_files:
- split: train
path: 20231101.hyw/train-*
- config_name: 20231101.ia
data_files:
- split: train
path: 20231101.ia/train-*
- config_name: 20231101.id
data_files:
- split: train
path: 20231101.id/train-*
- config_name: 20231101.ie
data_files:
- split: train
path: 20231101.ie/train-*
- config_name: 20231101.ig
data_files:
- split: train
path: 20231101.ig/train-*
- config_name: 20231101.ik
data_files:
- split: train
path: 20231101.ik/train-*
- config_name: 20231101.ilo
data_files:
- split: train
path: 20231101.ilo/train-*
- config_name: 20231101.inh
data_files:
- split: train
path: 20231101.inh/train-*
- config_name: 20231101.io
data_files:
- split: train
path: 20231101.io/train-*
- config_name: 20231101.is
data_files:
- split: train
path: 20231101.is/train-*
- config_name: 20231101.it
data_files:
- split: train
path: 20231101.it/train-*
- config_name: 20231101.iu
data_files:
- split: train
path: 20231101.iu/train-*
- config_name: 20231101.ja
data_files:
- split: train
path: 20231101.ja/train-*
- config_name: 20231101.jam
data_files:
- split: train
path: 20231101.jam/train-*
- config_name: 20231101.jbo
data_files:
- split: train
path: 20231101.jbo/train-*
- config_name: 20231101.jv
data_files:
- split: train
path: 20231101.jv/train-*
- config_name: 20231101.ka
data_files:
- split: train
path: 20231101.ka/train-*
- config_name: 20231101.kaa
data_files:
- split: train
path: 20231101.kaa/train-*
- config_name: 20231101.kab
data_files:
- split: train
path: 20231101.kab/train-*
- config_name: 20231101.kbd
data_files:
- split: train
path: 20231101.kbd/train-*
- config_name: 20231101.kbp
data_files:
- split: train
path: 20231101.kbp/train-*
- config_name: 20231101.kcg
data_files:
- split: train
path: 20231101.kcg/train-*
- config_name: 20231101.kg
data_files:
- split: train
path: 20231101.kg/train-*
- config_name: 20231101.ki
data_files:
- split: train
path: 20231101.ki/train-*
- config_name: 20231101.kk
data_files:
- split: train
path: 20231101.kk/train-*
- config_name: 20231101.kl
data_files:
- split: train
path: 20231101.kl/train-*
- config_name: 20231101.km
data_files:
- split: train
path: 20231101.km/train-*
- config_name: 20231101.kn
data_files:
- split: train
path: 20231101.kn/train-*
- config_name: 20231101.ko
data_files:
- split: train
path: 20231101.ko/train-*
- config_name: 20231101.koi
data_files:
- split: train
path: 20231101.koi/train-*
- config_name: 20231101.krc
data_files:
- split: train
path: 20231101.krc/train-*
- config_name: 20231101.ks
data_files:
- split: train
path: 20231101.ks/train-*
- config_name: 20231101.ksh
data_files:
- split: train
path: 20231101.ksh/train-*
- config_name: 20231101.ku
data_files:
- split: train
path: 20231101.ku/train-*
- config_name: 20231101.kv
data_files:
- split: train
path: 20231101.kv/train-*
- config_name: 20231101.kw
data_files:
- split: train
path: 20231101.kw/train-*
- config_name: 20231101.ky
data_files:
- split: train
path: 20231101.ky/train-*
- config_name: 20231101.la
data_files:
- split: train
path: 20231101.la/train-*
- config_name: 20231101.lad
data_files:
- split: train
path: 20231101.lad/train-*
- config_name: 20231101.lb
data_files:
- split: train
path: 20231101.lb/train-*
- config_name: 20231101.lbe
data_files:
- split: train
path: 20231101.lbe/train-*
- config_name: 20231101.lez
data_files:
- split: train
path: 20231101.lez/train-*
- config_name: 20231101.lfn
data_files:
- split: train
path: 20231101.lfn/train-*
- config_name: 20231101.lg
data_files:
- split: train
path: 20231101.lg/train-*
- config_name: 20231101.li
data_files:
- split: train
path: 20231101.li/train-*
- config_name: 20231101.lij
data_files:
- split: train
path: 20231101.lij/train-*
- config_name: 20231101.lld
data_files:
- split: train
path: 20231101.lld/train-*
- config_name: 20231101.lmo
data_files:
- split: train
path: 20231101.lmo/train-*
- config_name: 20231101.ln
data_files:
- split: train
path: 20231101.ln/train-*
- config_name: 20231101.lo
data_files:
- split: train
path: 20231101.lo/train-*
- config_name: 20231101.lt
data_files:
- split: train
path: 20231101.lt/train-*
- config_name: 20231101.ltg
data_files:
- split: train
path: 20231101.ltg/train-*
- config_name: 20231101.lv
data_files:
- split: train
path: 20231101.lv/train-*
- config_name: 20231101.mad
data_files:
- split: train
path: 20231101.mad/train-*
- config_name: 20231101.mai
data_files:
- split: train
path: 20231101.mai/train-*
- config_name: 20231101.map-bms
data_files:
- split: train
path: 20231101.map-bms/train-*
- config_name: 20231101.mdf
data_files:
- split: train
path: 20231101.mdf/train-*
- config_name: 20231101.mg
data_files:
- split: train
path: 20231101.mg/train-*
- config_name: 20231101.mhr
data_files:
- split: train
path: 20231101.mhr/train-*
- config_name: 20231101.mi
data_files:
- split: train
path: 20231101.mi/train-*
- config_name: 20231101.min
data_files:
- split: train
path: 20231101.min/train-*
- config_name: 20231101.mk
data_files:
- split: train
path: 20231101.mk/train-*
- config_name: 20231101.ml
data_files:
- split: train
path: 20231101.ml/train-*
- config_name: 20231101.mn
data_files:
- split: train
path: 20231101.mn/train-*
- config_name: 20231101.mni
data_files:
- split: train
path: 20231101.mni/train-*
- config_name: 20231101.mnw
data_files:
- split: train
path: 20231101.mnw/train-*
- config_name: 20231101.mr
data_files:
- split: train
path: 20231101.mr/train-*
- config_name: 20231101.mrj
data_files:
- split: train
path: 20231101.mrj/train-*
- config_name: 20231101.ms
data_files:
- split: train
path: 20231101.ms/train-*
- config_name: 20231101.mt
data_files:
- split: train
path: 20231101.mt/train-*
- config_name: 20231101.mwl
data_files:
- split: train
path: 20231101.mwl/train-*
- config_name: 20231101.my
data_files:
- split: train
path: 20231101.my/train-*
- config_name: 20231101.myv
data_files:
- split: train
path: 20231101.myv/train-*
- config_name: 20231101.mzn
data_files:
- split: train
path: 20231101.mzn/train-*
- config_name: 20231101.nah
data_files:
- split: train
path: 20231101.nah/train-*
- config_name: 20231101.nap
data_files:
- split: train
path: 20231101.nap/train-*
- config_name: 20231101.nds
data_files:
- split: train
path: 20231101.nds/train-*
- config_name: 20231101.nds-nl
data_files:
- split: train
path: 20231101.nds-nl/train-*
- config_name: 20231101.ne
data_files:
- split: train
path: 20231101.ne/train-*
- config_name: 20231101.new
data_files:
- split: train
path: 20231101.new/train-*
- config_name: 20231101.nia
data_files:
- split: train
path: 20231101.nia/train-*
- config_name: 20231101.nl
data_files:
- split: train
path: 20231101.nl/train-*
- config_name: 20231101.nn
data_files:
- split: train
path: 20231101.nn/train-*
- config_name: 20231101.no
data_files:
- split: train
path: 20231101.no/train-*
- config_name: 20231101.nov
data_files:
- split: train
path: 20231101.nov/train-*
- config_name: 20231101.nqo
data_files:
- split: train
path: 20231101.nqo/train-*
- config_name: 20231101.nrm
data_files:
- split: train
path: 20231101.nrm/train-*
- config_name: 20231101.nso
data_files:
- split: train
path: 20231101.nso/train-*
- config_name: 20231101.nv
data_files:
- split: train
path: 20231101.nv/train-*
- config_name: 20231101.ny
data_files:
- split: train
path: 20231101.ny/train-*
- config_name: 20231101.oc
data_files:
- split: train
path: 20231101.oc/train-*
- config_name: 20231101.olo
data_files:
- split: train
path: 20231101.olo/train-*
- config_name: 20231101.om
data_files:
- split: train
path: 20231101.om/train-*
- config_name: 20231101.or
data_files:
- split: train
path: 20231101.or/train-*
- config_name: 20231101.os
data_files:
- split: train
path: 20231101.os/train-*
- config_name: 20231101.pa
data_files:
- split: train
path: 20231101.pa/train-*
- config_name: 20231101.pag
data_files:
- split: train
path: 20231101.pag/train-*
- config_name: 20231101.pam
data_files:
- split: train
path: 20231101.pam/train-*
- config_name: 20231101.pap
data_files:
- split: train
path: 20231101.pap/train-*
- config_name: 20231101.pcd
data_files:
- split: train
path: 20231101.pcd/train-*
- config_name: 20231101.pcm
data_files:
- split: train
path: 20231101.pcm/train-*
- config_name: 20231101.pdc
data_files:
- split: train
path: 20231101.pdc/train-*
- config_name: 20231101.pfl
data_files:
- split: train
path: 20231101.pfl/train-*
- config_name: 20231101.pi
data_files:
- split: train
path: 20231101.pi/train-*
- config_name: 20231101.pih
data_files:
- split: train
path: 20231101.pih/train-*
- config_name: 20231101.pl
data_files:
- split: train
path: 20231101.pl/train-*
- config_name: 20231101.pms
data_files:
- split: train
path: 20231101.pms/train-*
- config_name: 20231101.pnb
data_files:
- split: train
path: 20231101.pnb/train-*
- config_name: 20231101.pnt
data_files:
- split: train
path: 20231101.pnt/train-*
- config_name: 20231101.ps
data_files:
- split: train
path: 20231101.ps/train-*
- config_name: 20231101.pt
data_files:
- split: train
path: 20231101.pt/train-*
- config_name: 20231101.pwn
data_files:
- split: train
path: 20231101.pwn/train-*
- config_name: 20231101.qu
data_files:
- split: train
path: 20231101.qu/train-*
- config_name: 20231101.rm
data_files:
- split: train
path: 20231101.rm/train-*
- config_name: 20231101.rmy
data_files:
- split: train
path: 20231101.rmy/train-*
- config_name: 20231101.rn
data_files:
- split: train
path: 20231101.rn/train-*
- config_name: 20231101.ro
data_files:
- split: train
path: 20231101.ro/train-*
- config_name: 20231101.roa-rup
data_files:
- split: train
path: 20231101.roa-rup/train-*
- config_name: 20231101.roa-tara
data_files:
- split: train
path: 20231101.roa-tara/train-*
- config_name: 20231101.ru
data_files:
- split: train
path: 20231101.ru/train-*
- config_name: 20231101.rue
data_files:
- split: train
path: 20231101.rue/train-*
- config_name: 20231101.rw
data_files:
- split: train
path: 20231101.rw/train-*
- config_name: 20231101.sa
data_files:
- split: train
path: 20231101.sa/train-*
- config_name: 20231101.sah
data_files:
- split: train
path: 20231101.sah/train-*
- config_name: 20231101.sat
data_files:
- split: train
path: 20231101.sat/train-*
- config_name: 20231101.sc
data_files:
- split: train
path: 20231101.sc/train-*
- config_name: 20231101.scn
data_files:
- split: train
path: 20231101.scn/train-*
- config_name: 20231101.sco
data_files:
- split: train
path: 20231101.sco/train-*
- config_name: 20231101.sd
data_files:
- split: train
path: 20231101.sd/train-*
- config_name: 20231101.se
data_files:
- split: train
path: 20231101.se/train-*
- config_name: 20231101.sg
data_files:
- split: train
path: 20231101.sg/train-*
- config_name: 20231101.sh
data_files:
- split: train
path: 20231101.sh/train-*
- config_name: 20231101.shi
data_files:
- split: train
path: 20231101.shi/train-*
- config_name: 20231101.shn
data_files:
- split: train
path: 20231101.shn/train-*
- config_name: 20231101.si
data_files:
- split: train
path: 20231101.si/train-*
- config_name: 20231101.simple
data_files:
- split: train
path: 20231101.simple/train-*
- config_name: 20231101.sk
data_files:
- split: train
path: 20231101.sk/train-*
- config_name: 20231101.skr
data_files:
- split: train
path: 20231101.skr/train-*
- config_name: 20231101.sl
data_files:
- split: train
path: 20231101.sl/train-*
- config_name: 20231101.sm
data_files:
- split: train
path: 20231101.sm/train-*
- config_name: 20231101.smn
data_files:
- split: train
path: 20231101.smn/train-*
- config_name: 20231101.sn
data_files:
- split: train
path: 20231101.sn/train-*
- config_name: 20231101.so
data_files:
- split: train
path: 20231101.so/train-*
- config_name: 20231101.sq
data_files:
- split: train
path: 20231101.sq/train-*
- config_name: 20231101.sr
data_files:
- split: train
path: 20231101.sr/train-*
- config_name: 20231101.srn
data_files:
- split: train
path: 20231101.srn/train-*
- config_name: 20231101.ss
data_files:
- split: train
path: 20231101.ss/train-*
- config_name: 20231101.st
data_files:
- split: train
path: 20231101.st/train-*
- config_name: 20231101.stq
data_files:
- split: train
path: 20231101.stq/train-*
- config_name: 20231101.su
data_files:
- split: train
path: 20231101.su/train-*
- config_name: 20231101.sv
data_files:
- split: train
path: 20231101.sv/train-*
- config_name: 20231101.sw
data_files:
- split: train
path: 20231101.sw/train-*
- config_name: 20231101.szl
data_files:
- split: train
path: 20231101.szl/train-*
- config_name: 20231101.szy
data_files:
- split: train
path: 20231101.szy/train-*
- config_name: 20231101.ta
data_files:
- split: train
path: 20231101.ta/train-*
- config_name: 20231101.tay
data_files:
- split: train
path: 20231101.tay/train-*
- config_name: 20231101.tcy
data_files:
- split: train
path: 20231101.tcy/train-*
- config_name: 20231101.te
data_files:
- split: train
path: 20231101.te/train-*
- config_name: 20231101.tet
data_files:
- split: train
path: 20231101.tet/train-*
- config_name: 20231101.tg
data_files:
- split: train
path: 20231101.tg/train-*
- config_name: 20231101.th
data_files:
- split: train
path: 20231101.th/train-*
- config_name: 20231101.ti
data_files:
- split: train
path: 20231101.ti/train-*
- config_name: 20231101.tk
data_files:
- split: train
path: 20231101.tk/train-*
- config_name: 20231101.tl
data_files:
- split: train
path: 20231101.tl/train-*
- config_name: 20231101.tly
data_files:
- split: train
path: 20231101.tly/train-*
- config_name: 20231101.tn
data_files:
- split: train
path: 20231101.tn/train-*
- config_name: 20231101.to
data_files:
- split: train
path: 20231101.to/train-*
- config_name: 20231101.tpi
data_files:
- split: train
path: 20231101.tpi/train-*
- config_name: 20231101.tr
data_files:
- split: train
path: 20231101.tr/train-*
- config_name: 20231101.trv
data_files:
- split: train
path: 20231101.trv/train-*
- config_name: 20231101.ts
data_files:
- split: train
path: 20231101.ts/train-*
- config_name: 20231101.tt
data_files:
- split: train
path: 20231101.tt/train-*
- config_name: 20231101.tum
data_files:
- split: train
path: 20231101.tum/train-*
- config_name: 20231101.tw
data_files:
- split: train
path: 20231101.tw/train-*
- config_name: 20231101.ty
data_files:
- split: train
path: 20231101.ty/train-*
- config_name: 20231101.tyv
data_files:
- split: train
path: 20231101.tyv/train-*
- config_name: 20231101.udm
data_files:
- split: train
path: 20231101.udm/train-*
- config_name: 20231101.ug
data_files:
- split: train
path: 20231101.ug/train-*
- config_name: 20231101.uk
data_files:
- split: train
path: 20231101.uk/train-*
- config_name: 20231101.ur
data_files:
- split: train
path: 20231101.ur/train-*
- config_name: 20231101.uz
data_files:
- split: train
path: 20231101.uz/train-*
- config_name: 20231101.ve
data_files:
- split: train
path: 20231101.ve/train-*
- config_name: 20231101.vec
data_files:
- split: train
path: 20231101.vec/train-*
- config_name: 20231101.vep
data_files:
- split: train
path: 20231101.vep/train-*
- config_name: 20231101.vi
data_files:
- split: train
path: 20231101.vi/train-*
- config_name: 20231101.vls
data_files:
- split: train
path: 20231101.vls/train-*
- config_name: 20231101.vo
data_files:
- split: train
path: 20231101.vo/train-*
- config_name: 20231101.wa
data_files:
- split: train
path: 20231101.wa/train-*
- config_name: 20231101.war
data_files:
- split: train
path: 20231101.war/train-*
- config_name: 20231101.wo
data_files:
- split: train
path: 20231101.wo/train-*
- config_name: 20231101.wuu
data_files:
- split: train
path: 20231101.wuu/train-*
- config_name: 20231101.xal
data_files:
- split: train
path: 20231101.xal/train-*
- config_name: 20231101.xh
data_files:
- split: train
path: 20231101.xh/train-*
- config_name: 20231101.xmf
data_files:
- split: train
path: 20231101.xmf/train-*
- config_name: 20231101.yi
data_files:
- split: train
path: 20231101.yi/train-*
- config_name: 20231101.yo
data_files:
- split: train
path: 20231101.yo/train-*
- config_name: 20231101.za
data_files:
- split: train
path: 20231101.za/train-*
- config_name: 20231101.zea
data_files:
- split: train
path: 20231101.zea/train-*
- config_name: 20231101.zh
data_files:
- split: train
path: 20231101.zh/train-*
- config_name: 20231101.zh-classical
data_files:
- split: train
path: 20231101.zh-classical/train-*
- config_name: 20231101.zh-min-nan
data_files:
- split: train
path: 20231101.zh-min-nan/train-*
- config_name: 20231101.zh-yue
data_files:
- split: train
path: 20231101.zh-yue/train-*
- config_name: 20231101.zu
data_files:
- split: train
path: 20231101.zu/train-*
dataset_info:
- config_name: 20231101.ab
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 4334455
num_examples: 6152
download_size: 1237796
dataset_size: 4334455
- config_name: 20231101.ace
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 5065801
num_examples: 13003
download_size: 1574258
dataset_size: 5065801
- config_name: 20231101.ady
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 765030
num_examples: 706
download_size: 347450
dataset_size: 765030
- config_name: 20231101.af
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 226672176
num_examples: 112518
download_size: 124485544
dataset_size: 226672176
- config_name: 20231101.als
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 81450196
num_examples: 30013
download_size: 49452211
dataset_size: 81450196
- config_name: 20231101.alt
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 6819963
num_examples: 1087
download_size: 2910477
dataset_size: 6819963
- config_name: 20231101.am
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 24218002
num_examples: 13906
download_size: 10720027
dataset_size: 24218002
- config_name: 20231101.ami
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 4460174
num_examples: 1628
download_size: 2261859
dataset_size: 4460174
- config_name: 20231101.an
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 57572050
num_examples: 44249
download_size: 29573020
dataset_size: 57572050
- config_name: 20231101.ang
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 2913906
num_examples: 4121
download_size: 1789811
dataset_size: 2913906
- config_name: 20231101.anp
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 9226211
num_examples: 2749
download_size: 3355979
dataset_size: 9226211
- config_name: 20231101.ar
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 3124486159
num_examples: 1219201
download_size: 1323304271
dataset_size: 3124486159
- config_name: 20231101.arc
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 849731
num_examples: 1936
download_size: 369584
dataset_size: 849731
- config_name: 20231101.ary
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 12049878
num_examples: 8087
download_size: 4672257
dataset_size: 12049878
- config_name: 20231101.arz
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 1402294447
num_examples: 1620194
download_size: 317231585
dataset_size: 1402294447
- config_name: 20231101.as
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 90312333
num_examples: 12338
download_size: 34581561
dataset_size: 90312333
- config_name: 20231101.ast
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 470575521
num_examples: 133419
download_size: 271196430
dataset_size: 470575521
- config_name: 20231101.atj
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 1012467
num_examples: 1971
download_size: 513962
dataset_size: 1012467
- config_name: 20231101.av
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 6084045
num_examples: 3426
download_size: 2573436
dataset_size: 6084045
- config_name: 20231101.avk
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 32119428
num_examples: 28353
download_size: 7984474
dataset_size: 32119428
- config_name: 20231101.awa
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 3703396
num_examples: 3679
download_size: 1269824
dataset_size: 3703396
- config_name: 20231101.ay
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 4395813
num_examples: 5384
download_size: 1756131
dataset_size: 4395813
- config_name: 20231101.az
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 433663157
num_examples: 196158
download_size: 230064038
dataset_size: 433663157
- config_name: 20231101.azb
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 187041147
num_examples: 243376
download_size: 46739926
dataset_size: 187041147
- config_name: 20231101.ba
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 297738837
num_examples: 63319
download_size: 122595805
dataset_size: 297738837
- config_name: 20231101.ban
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 18012727
num_examples: 20986
download_size: 6715876
dataset_size: 18012727
- config_name: 20231101.bar
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 36317102
num_examples: 27096
download_size: 21799389
dataset_size: 36317102
- config_name: 20231101.bat-smg
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 7212849
num_examples: 17221
download_size: 3348765
dataset_size: 7212849
- config_name: 20231101.bcl
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 20394331
num_examples: 15743
download_size: 11369234
dataset_size: 20394331
- config_name: 20231101.be
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 624718980
num_examples: 236165
download_size: 284921288
dataset_size: 624718980
- config_name: 20231101.be-x-old
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 252510447
num_examples: 84361
download_size: 114318588
dataset_size: 252510447
- config_name: 20231101.bg
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 1103334425
num_examples: 294275
download_size: 512344058
dataset_size: 1103334425
- config_name: 20231101.bh
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 16675295
num_examples: 8612
download_size: 5880458
dataset_size: 16675295
- config_name: 20231101.bi
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 404249
num_examples: 1548
download_size: 203610
dataset_size: 404249
- config_name: 20231101.bjn
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 6884860
num_examples: 10519
download_size: 3323032
dataset_size: 6884860
- config_name: 20231101.blk
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 26566991
num_examples: 2946
download_size: 8028430
dataset_size: 26566991
- config_name: 20231101.bm
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 623659
num_examples: 1258
download_size: 343812
dataset_size: 623659
- config_name: 20231101.bn
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 962624238
num_examples: 143069
download_size: 343885999
dataset_size: 962624238
- config_name: 20231101.bo
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 132723880
num_examples: 12881
download_size: 38851784
dataset_size: 132723880
- config_name: 20231101.bpy
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 42975314
num_examples: 25165
download_size: 6568483
dataset_size: 42975314
- config_name: 20231101.br
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 85635744
num_examples: 84340
download_size: 49768597
dataset_size: 85635744
- config_name: 20231101.bs
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 193734399
num_examples: 92596
download_size: 107858627
dataset_size: 193734399
- config_name: 20231101.bug
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 3434889
num_examples: 15880
download_size: 817034
dataset_size: 3434889
- config_name: 20231101.bxr
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 6687172
num_examples: 2791
download_size: 3078699
dataset_size: 6687172
- config_name: 20231101.ca
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 1958810542
num_examples: 737409
download_size: 1116799343
dataset_size: 1958810542
- config_name: 20231101.cbk-zam
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 2061944
num_examples: 3285
download_size: 825899
dataset_size: 2061944
- config_name: 20231101.cdo
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 5109207
num_examples: 16449
download_size: 1982914
dataset_size: 5109207
- config_name: 20231101.ce
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 730387049
num_examples: 601271
download_size: 88393330
dataset_size: 730387049
- config_name: 20231101.ceb
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 4568256711
num_examples: 6122708
download_size: 828085216
dataset_size: 4568256711
- config_name: 20231101.ch
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 178002
num_examples: 576
download_size: 89277
dataset_size: 178002
- config_name: 20231101.chr
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 767618
num_examples: 1113
download_size: 343140
dataset_size: 767618
- config_name: 20231101.chy
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 148139
num_examples: 802
download_size: 75865
dataset_size: 148139
- config_name: 20231101.ckb
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 107150420
num_examples: 52024
download_size: 42964544
dataset_size: 107150420
- config_name: 20231101.co
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 11104243
num_examples: 7799
download_size: 5794731
dataset_size: 11104243
- config_name: 20231101.cr
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 57257
num_examples: 187
download_size: 36081
dataset_size: 57257
- config_name: 20231101.crh
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 9689171
num_examples: 27691
download_size: 3654461
dataset_size: 9689171
- config_name: 20231101.cs
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 1566286962
num_examples: 534044
download_size: 976484249
dataset_size: 1566286962
- config_name: 20231101.csb
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 3748643
num_examples: 5480
download_size: 2055233
dataset_size: 3748643
- config_name: 20231101.cu
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 981592
num_examples: 1235
download_size: 398252
dataset_size: 981592
- config_name: 20231101.cv
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 81873026
num_examples: 51863
download_size: 29640641
dataset_size: 81873026
- config_name: 20231101.cy
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 305837783
num_examples: 279455
download_size: 112257456
dataset_size: 305837783
- config_name: 20231101.da
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 547068330
num_examples: 295347
download_size: 327688122
dataset_size: 547068330
- config_name: 20231101.dag
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 21618973
num_examples: 10071
download_size: 9026986
dataset_size: 21618973
- config_name: 20231101.de
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 9622925305
num_examples: 2845308
download_size: 5771317942
dataset_size: 9622925305
- config_name: 20231101.din
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 564398
num_examples: 512
download_size: 340530
dataset_size: 564398
- config_name: 20231101.diq
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 19671441
num_examples: 41775
download_size: 7616839
dataset_size: 19671441
- config_name: 20231101.dsb
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 3315228
num_examples: 3379
download_size: 1931937
dataset_size: 3315228
- config_name: 20231101.dty
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 7030648
num_examples: 3632
download_size: 2521250
dataset_size: 7030648
- config_name: 20231101.dv
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 13934393
num_examples: 4352
download_size: 5283133
dataset_size: 13934393
- config_name: 20231101.dz
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 8855969
num_examples: 788
download_size: 2583520
dataset_size: 8855969
- config_name: 20231101.ee
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 898491
num_examples: 1181
download_size: 492813
dataset_size: 898491
- config_name: 20231101.el
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 1345589075
num_examples: 226834
download_size: 637372489
dataset_size: 1345589075
- config_name: 20231101.eml
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 3625415
num_examples: 12961
download_size: 1689575
dataset_size: 3625415
- config_name: 20231101.en
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 20200062385
num_examples: 6407814
download_size: 11630929031
dataset_size: 20200062385
- config_name: 20231101.eo
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 523113804
num_examples: 344851
download_size: 297738138
dataset_size: 523113804
- config_name: 20231101.es
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 6033536133
num_examples: 1841155
download_size: 3493595869
dataset_size: 6033536133
- config_name: 20231101.et
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 440177170
num_examples: 240397
download_size: 265444734
dataset_size: 440177170
- config_name: 20231101.eu
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 565567318
num_examples: 416347
download_size: 270355505
dataset_size: 565567318
- config_name: 20231101.ext
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 4389633
num_examples: 3785
download_size: 2761099
dataset_size: 4389633
- config_name: 20231101.fa
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 1899154938
num_examples: 979869
download_size: 759368283
dataset_size: 1899154938
- config_name: 20231101.fat
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 2032812
num_examples: 1122
download_size: 1124684
dataset_size: 2032812
- config_name: 20231101.ff
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 1867995
num_examples: 2419
download_size: 1087702
dataset_size: 1867995
- config_name: 20231101.fi
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 1146146663
num_examples: 561598
download_size: 680512230
dataset_size: 1146146663
- config_name: 20231101.fiu-vro
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 4636361
num_examples: 6590
download_size: 2434159
dataset_size: 4636361
- config_name: 20231101.fj
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 604791
num_examples: 1294
download_size: 328059
dataset_size: 604791
- config_name: 20231101.fo
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 15415249
num_examples: 14080
download_size: 8857239
dataset_size: 15415249
- config_name: 20231101.fon
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 592216
num_examples: 705
download_size: 317444
dataset_size: 592216
- config_name: 20231101.fr
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 8065794826
num_examples: 2564646
download_size: 4614488286
dataset_size: 8065794826
- config_name: 20231101.frp
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 3676441
num_examples: 5766
download_size: 1914046
dataset_size: 3676441
- config_name: 20231101.frr
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 10819914
num_examples: 18666
download_size: 5317694
dataset_size: 10819914
- config_name: 20231101.fur
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 4090412
num_examples: 4001
download_size: 2421238
dataset_size: 4090412
- config_name: 20231101.fy
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 134196708
num_examples: 52416
download_size: 76002257
dataset_size: 134196708
- config_name: 20231101.ga
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 60640820
num_examples: 59156
download_size: 34136733
dataset_size: 60640820
- config_name: 20231101.gag
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 2428849
num_examples: 2968
download_size: 1331866
dataset_size: 2428849
- config_name: 20231101.gan
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 2915229
num_examples: 6743
download_size: 1508844
dataset_size: 2915229
- config_name: 20231101.gcr
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 2338277
num_examples: 2399
download_size: 1345482
dataset_size: 2338277
- config_name: 20231101.gd
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 14051607
num_examples: 15979
download_size: 7190137
dataset_size: 14051607
- config_name: 20231101.gl
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 493905881
num_examples: 200092
download_size: 291104907
dataset_size: 493905881
- config_name: 20231101.glk
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 6086185
num_examples: 7049
download_size: 2382997
dataset_size: 6086185
- config_name: 20231101.gn
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 6921948
num_examples: 5519
download_size: 3806548
dataset_size: 6921948
- config_name: 20231101.gom
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 30889533
num_examples: 4259
download_size: 11306217
dataset_size: 30889533
- config_name: 20231101.gor
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 6369540
num_examples: 15359
download_size: 2101154
dataset_size: 6369540
- config_name: 20231101.got
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 1533770
num_examples: 1013
download_size: 636307
dataset_size: 1533770
- config_name: 20231101.gpe
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 2017667
num_examples: 1110
download_size: 1141261
dataset_size: 2017667
- config_name: 20231101.gu
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 121282557
num_examples: 30445
download_size: 39554078
dataset_size: 121282557
- config_name: 20231101.guc
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 978923
num_examples: 679
download_size: 578311
dataset_size: 978923
- config_name: 20231101.gur
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 2325435
num_examples: 1383
download_size: 1068954
dataset_size: 2325435
- config_name: 20231101.guw
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 1913143
num_examples: 1312
download_size: 1042328
dataset_size: 1913143
- config_name: 20231101.gv
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 6307253
num_examples: 6206
download_size: 3347095
dataset_size: 6307253
- config_name: 20231101.ha
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 77906472
num_examples: 36492
download_size: 43131815
dataset_size: 77906472
- config_name: 20231101.hak
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 4523680
num_examples: 10246
download_size: 1878558
dataset_size: 4523680
- config_name: 20231101.haw
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 1677790
num_examples: 2612
download_size: 696781
dataset_size: 1677790
- config_name: 20231101.he
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 1950200381
num_examples: 333874
download_size: 979183998
dataset_size: 1950200381
- config_name: 20231101.hi
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 672817362
num_examples: 163093
download_size: 237834604
dataset_size: 672817362
- config_name: 20231101.hif
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 5685329
num_examples: 10986
download_size: 2715682
dataset_size: 5685329
- config_name: 20231101.hr
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 443636903
num_examples: 202848
download_size: 275245343
dataset_size: 443636903
- config_name: 20231101.hsb
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 15667118
num_examples: 13957
download_size: 7437491
dataset_size: 15667118
- config_name: 20231101.ht
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 55088040
num_examples: 70159
download_size: 21993952
dataset_size: 55088040
- config_name: 20231101.hu
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 1515899113
num_examples: 532427
download_size: 904857314
dataset_size: 1515899113
- config_name: 20231101.hy
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 1179459973
num_examples: 303036
download_size: 490121120
dataset_size: 1179459973
- config_name: 20231101.hyw
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 59564550
num_examples: 11725
download_size: 27450541
dataset_size: 59564550
- config_name: 20231101.ia
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 16409449
num_examples: 28247
download_size: 8237640
dataset_size: 16409449
- config_name: 20231101.id
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 1125928594
num_examples: 665622
download_size: 583801799
dataset_size: 1125928594
- config_name: 20231101.ie
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 6737711
num_examples: 11877
download_size: 3019044
dataset_size: 6737711
- config_name: 20231101.ig
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 66086115
num_examples: 22908
download_size: 34663540
dataset_size: 66086115
- config_name: 20231101.ik
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 199773
num_examples: 846
download_size: 115758
dataset_size: 199773
- config_name: 20231101.ilo
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 16854494
num_examples: 15371
download_size: 7352572
dataset_size: 16854494
- config_name: 20231101.inh
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 2727253
num_examples: 2123
download_size: 1279524
dataset_size: 2727253
- config_name: 20231101.io
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 38735196
num_examples: 40930
download_size: 17106040
dataset_size: 38735196
- config_name: 20231101.is
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 87856729
num_examples: 57453
download_size: 52286137
dataset_size: 87856729
- config_name: 20231101.it
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 4924856310
num_examples: 1833639
download_size: 2931265519
dataset_size: 4924856310
- config_name: 20231101.iu
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 291185
num_examples: 562
download_size: 136987
dataset_size: 291185
- config_name: 20231101.ja
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 7039610767
num_examples: 1389467
download_size: 3941998526
dataset_size: 7039610767
- config_name: 20231101.jam
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 1142348
num_examples: 1780
download_size: 702664
dataset_size: 1142348
- config_name: 20231101.jbo
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 2523538
num_examples: 1394
download_size: 890356
dataset_size: 2523538
- config_name: 20231101.jv
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 72786688
num_examples: 73380
download_size: 36852134
dataset_size: 72786688
- config_name: 20231101.ka
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 699872960
num_examples: 169602
download_size: 239987665
dataset_size: 699872960
- config_name: 20231101.kaa
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 5139436
num_examples: 4074
download_size: 2913134
dataset_size: 5139436
- config_name: 20231101.kab
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 4392542
num_examples: 5830
download_size: 2580584
dataset_size: 4392542
- config_name: 20231101.kbd
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 3014575
num_examples: 1670
download_size: 1304580
dataset_size: 3014575
- config_name: 20231101.kbp
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 3584563
num_examples: 1931
download_size: 1806400
dataset_size: 3584563
- config_name: 20231101.kcg
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 914665
num_examples: 1151
download_size: 513904
dataset_size: 914665
- config_name: 20231101.kg
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 390163
num_examples: 1329
download_size: 209059
dataset_size: 390163
- config_name: 20231101.ki
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 760980
num_examples: 1668
download_size: 427003
dataset_size: 760980
- config_name: 20231101.kk
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 497917145
num_examples: 238615
download_size: 180750520
dataset_size: 497917145
- config_name: 20231101.kl
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 313658
num_examples: 301
download_size: 193719
dataset_size: 313658
- config_name: 20231101.km
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 103252582
num_examples: 11994
download_size: 35567417
dataset_size: 103252582
- config_name: 20231101.kn
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 402848197
num_examples: 31437
download_size: 147156434
dataset_size: 402848197
- config_name: 20231101.ko
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 1412099944
num_examples: 647897
download_size: 782677061
dataset_size: 1412099944
- config_name: 20231101.koi
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 5103799
num_examples: 3504
download_size: 1888392
dataset_size: 5103799
- config_name: 20231101.krc
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 4589808
num_examples: 2100
download_size: 2022144
dataset_size: 4589808
- config_name: 20231101.ks
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 2868186
num_examples: 4307
download_size: 1094458
dataset_size: 2868186
- config_name: 20231101.ksh
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 3117003
num_examples: 2945
download_size: 2009928
dataset_size: 3117003
- config_name: 20231101.ku
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 44523131
num_examples: 63076
download_size: 22938233
dataset_size: 44523131
- config_name: 20231101.kv
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 9245577
num_examples: 5595
download_size: 3690978
dataset_size: 9245577
- config_name: 20231101.kw
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 4687165
num_examples: 6995
download_size: 2711398
dataset_size: 4687165
- config_name: 20231101.ky
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 166911089
num_examples: 79438
download_size: 63947035
dataset_size: 166911089
- config_name: 20231101.la
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 141080163
num_examples: 138263
download_size: 76588430
dataset_size: 141080163
- config_name: 20231101.lad
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 4901343
num_examples: 3663
download_size: 2754531
dataset_size: 4901343
- config_name: 20231101.lb
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 88826996
num_examples: 62414
download_size: 50515020
dataset_size: 88826996
- config_name: 20231101.lbe
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 745140
num_examples: 1279
download_size: 304394
dataset_size: 745140
- config_name: 20231101.lez
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 9794637
num_examples: 4264
download_size: 3864848
dataset_size: 9794637
- config_name: 20231101.lfn
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 8870685
num_examples: 4832
download_size: 5207546
dataset_size: 8870685
- config_name: 20231101.lg
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 6891539
num_examples: 4048
download_size: 3708097
dataset_size: 6891539
- config_name: 20231101.li
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 29633678
num_examples: 14849
download_size: 17727918
dataset_size: 29633678
- config_name: 20231101.lij
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 11448686
num_examples: 11203
download_size: 6255409
dataset_size: 11448686
- config_name: 20231101.lld
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 50163974
num_examples: 180677
download_size: 13866243
dataset_size: 50163974
- config_name: 20231101.lmo
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 43496783
num_examples: 73510
download_size: 19142356
dataset_size: 43496783
- config_name: 20231101.ln
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 2035050
num_examples: 3534
download_size: 1122138
dataset_size: 2035050
- config_name: 20231101.lo
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 15283258
num_examples: 5014
download_size: 5646554
dataset_size: 15283258
- config_name: 20231101.lt
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 336559824
num_examples: 211292
download_size: 194873569
dataset_size: 336559824
- config_name: 20231101.ltg
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 915364
num_examples: 1070
download_size: 530299
dataset_size: 915364
- config_name: 20231101.lv
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 227272112
num_examples: 123413
download_size: 129739227
dataset_size: 227272112
- config_name: 20231101.mad
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 1596836
num_examples: 1192
download_size: 908630
dataset_size: 1596836
- config_name: 20231101.mai
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 21562856
num_examples: 14714
download_size: 6180231
dataset_size: 21562856
- config_name: 20231101.map-bms
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 5341068
num_examples: 13580
download_size: 2377123
dataset_size: 5341068
- config_name: 20231101.mdf
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 4694770
num_examples: 4257
download_size: 1725294
dataset_size: 4694770
- config_name: 20231101.mg
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 73767229
num_examples: 96316
download_size: 22117304
dataset_size: 73767229
- config_name: 20231101.mhr
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 19249450
num_examples: 11347
download_size: 6902162
dataset_size: 19249450
- config_name: 20231101.mi
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 4169094
num_examples: 7919
download_size: 1044444
dataset_size: 4169094
- config_name: 20231101.min
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 118995918
num_examples: 227143
download_size: 25691303
dataset_size: 118995918
- config_name: 20231101.mk
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 651422351
num_examples: 139559
download_size: 271265486
dataset_size: 651422351
- config_name: 20231101.ml
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 494135127
num_examples: 85791
download_size: 183071274
dataset_size: 494135127
- config_name: 20231101.mn
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 91943210
num_examples: 24048
download_size: 41521786
dataset_size: 91943210
- config_name: 20231101.mni
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 9820483
num_examples: 10894
download_size: 2208525
dataset_size: 9820483
- config_name: 20231101.mnw
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 47237206
num_examples: 3295
download_size: 13765461
dataset_size: 47237206
- config_name: 20231101.mr
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 261879018
num_examples: 94133
download_size: 81991233
dataset_size: 261879018
- config_name: 20231101.mrj
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 8732281
num_examples: 10542
download_size: 3283618
dataset_size: 8732281
- config_name: 20231101.ms
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 423352360
num_examples: 368628
download_size: 210149264
dataset_size: 423352360
- config_name: 20231101.mt
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 32009639
num_examples: 5743
download_size: 18686521
dataset_size: 32009639
- config_name: 20231101.mwl
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 19353725
num_examples: 4500
download_size: 11521563
dataset_size: 19353725
- config_name: 20231101.my
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 314417700
num_examples: 109310
download_size: 85497205
dataset_size: 314417700
- config_name: 20231101.myv
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 11145865
num_examples: 7958
download_size: 4600620
dataset_size: 11145865
- config_name: 20231101.mzn
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 16335757
num_examples: 18717
download_size: 5419390
dataset_size: 16335757
- config_name: 20231101.nah
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 2503320
num_examples: 6218
download_size: 1191779
dataset_size: 2503320
- config_name: 20231101.nap
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 6395706
num_examples: 14884
download_size: 3188122
dataset_size: 6395706
- config_name: 20231101.nds
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 92990126
num_examples: 84285
download_size: 48106879
dataset_size: 92990126
- config_name: 20231101.nds-nl
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 13582403
num_examples: 7847
download_size: 8354427
dataset_size: 13582403
- config_name: 20231101.ne
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 109032486
num_examples: 32885
download_size: 37548833
dataset_size: 109032486
- config_name: 20231101.new
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 159095610
num_examples: 73003
download_size: 20517810
dataset_size: 159095610
- config_name: 20231101.nia
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 2117902
num_examples: 1714
download_size: 1086670
dataset_size: 2117902
- config_name: 20231101.nl
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 2646316266
num_examples: 2135977
download_size: 1436843432
dataset_size: 2646316266
- config_name: 20231101.nn
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 237467406
num_examples: 167653
download_size: 134751873
dataset_size: 237467406
- config_name: 20231101.no
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 1033188011
num_examples: 617937
download_size: 590970350
dataset_size: 1033188011
- config_name: 20231101.nov
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 965640
num_examples: 1693
download_size: 493500
dataset_size: 965640
- config_name: 20231101.nqo
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 8261058
num_examples: 1580
download_size: 3508645
dataset_size: 8261058
- config_name: 20231101.nrm
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 3216817
num_examples: 4902
download_size: 1507257
dataset_size: 3216817
- config_name: 20231101.nso
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 2796467
num_examples: 8650
download_size: 936349
dataset_size: 2796467
- config_name: 20231101.nv
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 16993060
num_examples: 22460
download_size: 3304031
dataset_size: 16993060
- config_name: 20231101.ny
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 1691825
num_examples: 1129
download_size: 938621
dataset_size: 1691825
- config_name: 20231101.oc
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 120092607
num_examples: 89101
download_size: 64043588
dataset_size: 120092607
- config_name: 20231101.olo
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 3173332
num_examples: 4640
download_size: 1724315
dataset_size: 3173332
- config_name: 20231101.om
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 3604768
num_examples: 1970
download_size: 1982849
dataset_size: 3604768
- config_name: 20231101.or
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 75078226
num_examples: 17375
download_size: 26706212
dataset_size: 75078226
- config_name: 20231101.os
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 13182881
num_examples: 17663
download_size: 5572799
dataset_size: 13182881
- config_name: 20231101.pa
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 212972877
num_examples: 51423
download_size: 81452929
dataset_size: 212972877
- config_name: 20231101.pag
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 1391816
num_examples: 2665
download_size: 455808
dataset_size: 1391816
- config_name: 20231101.pam
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 8294902
num_examples: 9006
download_size: 4277038
dataset_size: 8294902
- config_name: 20231101.pap
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 4251480
num_examples: 3520
download_size: 2435005
dataset_size: 4251480
- config_name: 20231101.pcd
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 5704321
num_examples: 5717
download_size: 3145572
dataset_size: 5704321
- config_name: 20231101.pcm
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 1886987
num_examples: 1238
download_size: 1160762
dataset_size: 1886987
- config_name: 20231101.pdc
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 1225978
num_examples: 2176
download_size: 698254
dataset_size: 1225978
- config_name: 20231101.pfl
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 3694464
num_examples: 2762
download_size: 1971214
dataset_size: 3694464
- config_name: 20231101.pi
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 1144100
num_examples: 3057
download_size: 200764
dataset_size: 1144100
- config_name: 20231101.pih
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 278139
num_examples: 934
download_size: 177092
dataset_size: 278139
- config_name: 20231101.pl
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 2950148809
num_examples: 1587721
download_size: 1765059986
dataset_size: 2950148809
- config_name: 20231101.pms
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 34340217
num_examples: 67980
download_size: 12008880
dataset_size: 34340217
- config_name: 20231101.pnb
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 304117649
num_examples: 72307
download_size: 133266242
dataset_size: 304117649
- config_name: 20231101.pnt
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 630636
num_examples: 533
download_size: 275639
dataset_size: 630636
- config_name: 20231101.ps
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 114259737
num_examples: 20529
download_size: 53312545
dataset_size: 114259737
- config_name: 20231101.pt
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 2758783436
num_examples: 1112246
download_size: 1579641059
dataset_size: 2758783436
- config_name: 20231101.pwn
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 811954
num_examples: 408
download_size: 444109
dataset_size: 811954
- config_name: 20231101.qu
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 16828457
num_examples: 24196
download_size: 7688106
dataset_size: 16828457
- config_name: 20231101.rm
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 18053014
num_examples: 3822
download_size: 10483970
dataset_size: 18053014
- config_name: 20231101.rmy
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 611778
num_examples: 1279
download_size: 356457
dataset_size: 611778
- config_name: 20231101.rn
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 530318
num_examples: 819
download_size: 301252
dataset_size: 530318
- config_name: 20231101.ro
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 847410736
num_examples: 442389
download_size: 466937380
dataset_size: 847410736
- config_name: 20231101.roa-rup
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 1687829
num_examples: 1432
download_size: 951677
dataset_size: 1687829
- config_name: 20231101.roa-tara
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 7470331
num_examples: 9367
download_size: 4003095
dataset_size: 7470331
- config_name: 20231101.ru
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 10277958919
num_examples: 1945063
download_size: 4876849588
dataset_size: 10277958919
- config_name: 20231101.rue
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 13128572
num_examples: 8759
download_size: 6346106
dataset_size: 13128572
- config_name: 20231101.rw
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 11898854
num_examples: 8063
download_size: 6623388
dataset_size: 11898854
- config_name: 20231101.sa
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 69854997
num_examples: 12156
download_size: 23850161
dataset_size: 69854997
- config_name: 20231101.sah
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 48562374
num_examples: 17098
download_size: 21675888
dataset_size: 48562374
- config_name: 20231101.sat
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 45247783
num_examples: 9767
download_size: 15428584
dataset_size: 45247783
- config_name: 20231101.sc
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 12776438
num_examples: 7586
download_size: 7711996
dataset_size: 12776438
- config_name: 20231101.scn
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 17685098
num_examples: 26530
download_size: 10223816
dataset_size: 17685098
- config_name: 20231101.sco
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 42808738
num_examples: 35276
download_size: 24287944
dataset_size: 42808738
- config_name: 20231101.sd
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 37021659
num_examples: 16928
download_size: 17591997
dataset_size: 37021659
- config_name: 20231101.se
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 3600527
num_examples: 8043
download_size: 1816006
dataset_size: 3600527
- config_name: 20231101.sg
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 140127
num_examples: 564
download_size: 72486
dataset_size: 140127
- config_name: 20231101.sh
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 569225870
num_examples: 458392
download_size: 266379293
dataset_size: 569225870
- config_name: 20231101.shi
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 2369002
num_examples: 1779
download_size: 1359828
dataset_size: 2369002
- config_name: 20231101.shn
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 33553593
num_examples: 13945
download_size: 8163231
dataset_size: 33553593
- config_name: 20231101.si
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 138806443
num_examples: 23065
download_size: 54229127
dataset_size: 138806443
- config_name: 20231101.simple
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 291254232
num_examples: 241787
download_size: 156885218
dataset_size: 291254232
- config_name: 20231101.sk
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 416804817
num_examples: 242235
download_size: 239513292
dataset_size: 416804817
- config_name: 20231101.skr
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 22705446
num_examples: 5819
download_size: 9978607
dataset_size: 22705446
- config_name: 20231101.sl
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 454829910
num_examples: 183006
download_size: 267485569
dataset_size: 454829910
- config_name: 20231101.sm
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 902927
num_examples: 1151
download_size: 492349
dataset_size: 902927
- config_name: 20231101.smn
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 5764244
num_examples: 5383
download_size: 2813872
dataset_size: 5764244
- config_name: 20231101.sn
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 9790528
num_examples: 11621
download_size: 4979456
dataset_size: 9790528
- config_name: 20231101.so
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 13663784
num_examples: 9021
download_size: 7940363
dataset_size: 13663784
- config_name: 20231101.sq
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 208779652
num_examples: 104854
download_size: 116945494
dataset_size: 208779652
- config_name: 20231101.sr
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 1721596392
num_examples: 676605
download_size: 697391786
dataset_size: 1721596392
- config_name: 20231101.srn
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 649317
num_examples: 1219
download_size: 215103
dataset_size: 649317
- config_name: 20231101.ss
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 1076102
num_examples: 945
download_size: 600997
dataset_size: 1076102
- config_name: 20231101.st
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 968161
num_examples: 1099
download_size: 530165
dataset_size: 968161
- config_name: 20231101.stq
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 4942784
num_examples: 4134
download_size: 2884429
dataset_size: 4942784
- config_name: 20231101.su
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 48066965
num_examples: 61555
download_size: 19806020
dataset_size: 48066965
- config_name: 20231101.sv
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 2153690744
num_examples: 2574513
download_size: 974261228
dataset_size: 2153690744
- config_name: 20231101.sw
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 73119299
num_examples: 78587
download_size: 35936177
dataset_size: 73119299
- config_name: 20231101.szl
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 21439309
num_examples: 57035
download_size: 7347967
dataset_size: 21439309
- config_name: 20231101.szy
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 11355780
num_examples: 4885
download_size: 6192815
dataset_size: 11355780
- config_name: 20231101.ta
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 810734099
num_examples: 160651
download_size: 265652020
dataset_size: 810734099
- config_name: 20231101.tay
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 2974229
num_examples: 2747
download_size: 1232811
dataset_size: 2974229
- config_name: 20231101.tcy
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 12166612
num_examples: 2202
download_size: 4611006
dataset_size: 12166612
- config_name: 20231101.te
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 730376585
num_examples: 87854
download_size: 215097076
dataset_size: 730376585
- config_name: 20231101.tet
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 1466200
num_examples: 1468
download_size: 744390
dataset_size: 1466200
- config_name: 20231101.tg
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 148256281
num_examples: 110962
download_size: 49825647
dataset_size: 148256281
- config_name: 20231101.th
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 1014547923
num_examples: 159719
download_size: 371916105
dataset_size: 1014547923
- config_name: 20231101.ti
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 729995
num_examples: 435
download_size: 363723
dataset_size: 729995
- config_name: 20231101.tk
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 13326412
num_examples: 7918
download_size: 7383654
dataset_size: 13326412
- config_name: 20231101.tl
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 85794472
num_examples: 45341
download_size: 45797527
dataset_size: 85794472
- config_name: 20231101.tly
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 2590482
num_examples: 8086
download_size: 1070456
dataset_size: 2590482
- config_name: 20231101.tn
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 4380768
num_examples: 1585
download_size: 1708110
dataset_size: 4380768
- config_name: 20231101.to
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 1090611
num_examples: 1887
download_size: 518244
dataset_size: 1090611
- config_name: 20231101.tpi
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 460420
num_examples: 1399
download_size: 241908
dataset_size: 460420
- config_name: 20231101.tr
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 997254242
num_examples: 534988
download_size: 552923659
dataset_size: 997254242
- config_name: 20231101.trv
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 4971204
num_examples: 1880
download_size: 2706664
dataset_size: 4971204
- config_name: 20231101.ts
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 847032
num_examples: 785
download_size: 455648
dataset_size: 847032
- config_name: 20231101.tt
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 681325421
num_examples: 501116
download_size: 129141056
dataset_size: 681325421
- config_name: 20231101.tum
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 13429984
num_examples: 18708
download_size: 5459856
dataset_size: 13429984
- config_name: 20231101.tw
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 7982767
num_examples: 3978
download_size: 4118530
dataset_size: 7982767
- config_name: 20231101.ty
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 338743
num_examples: 1355
download_size: 150963
dataset_size: 338743
- config_name: 20231101.tyv
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 14324694
num_examples: 3491
download_size: 6528290
dataset_size: 14324694
- config_name: 20231101.udm
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 7036113
num_examples: 5677
download_size: 2982821
dataset_size: 7036113
- config_name: 20231101.ug
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 42254159
num_examples: 8634
download_size: 17741860
dataset_size: 42254159
- config_name: 20231101.uk
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 4969483901
num_examples: 1294720
download_size: 2276769383
dataset_size: 4969483901
- config_name: 20231101.ur
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 410511855
num_examples: 200154
download_size: 167627869
dataset_size: 410511855
- config_name: 20231101.uz
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 397176774
num_examples: 246729
download_size: 210262652
dataset_size: 397176774
- config_name: 20231101.ve
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 359542
num_examples: 840
download_size: 163318
dataset_size: 359542
- config_name: 20231101.vec
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 37917528
num_examples: 69268
download_size: 16179506
dataset_size: 37917528
- config_name: 20231101.vep
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 11643856
num_examples: 6960
download_size: 6423002
dataset_size: 11643856
- config_name: 20231101.vi
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 1617830227
num_examples: 1288680
download_size: 729557588
dataset_size: 1617830227
- config_name: 20231101.vls
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 11336278
num_examples: 7872
download_size: 6985406
dataset_size: 11336278
- config_name: 20231101.vo
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 19521708
num_examples: 35193
download_size: 6582571
dataset_size: 19521708
- config_name: 20231101.wa
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 12268826
num_examples: 12038
download_size: 7327616
dataset_size: 12268826
- config_name: 20231101.war
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 467647882
num_examples: 1266394
download_size: 104588442
dataset_size: 467647882
- config_name: 20231101.wo
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 3525303
num_examples: 1746
download_size: 2094574
dataset_size: 3525303
- config_name: 20231101.wuu
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 25029545
num_examples: 43010
download_size: 15985963
dataset_size: 25029545
- config_name: 20231101.xal
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 1391731
num_examples: 2295
download_size: 507198
dataset_size: 1391731
- config_name: 20231101.xh
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 3665998
num_examples: 1883
download_size: 2505472
dataset_size: 3665998
- config_name: 20231101.xmf
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 37712629
num_examples: 18099
download_size: 12948576
dataset_size: 37712629
- config_name: 20231101.yi
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 36038273
num_examples: 15179
download_size: 16218296
dataset_size: 36038273
- config_name: 20231101.yo
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 19081408
num_examples: 33819
download_size: 8861465
dataset_size: 19081408
- config_name: 20231101.za
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 1365300
num_examples: 2993
download_size: 666521
dataset_size: 1365300
- config_name: 20231101.zea
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 5224563
num_examples: 6082
download_size: 2620396
dataset_size: 5224563
- config_name: 20231101.zh
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 2790577882
num_examples: 1384748
download_size: 1721150260
dataset_size: 2790577882
- config_name: 20231101.zh-classical
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 14869227
num_examples: 12708
download_size: 10098073
dataset_size: 14869227
- config_name: 20231101.zh-min-nan
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 153672031
num_examples: 432798
download_size: 37122048
dataset_size: 153672031
- config_name: 20231101.zh-yue
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 109936351
num_examples: 134140
download_size: 64950815
dataset_size: 109936351
- config_name: 20231101.zu
features:
- name: id
dtype: string
- name: url
dtype: string
- name: title
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 7088246
num_examples: 11561
download_size: 3792429
dataset_size: 7088246
language_bcp47:
- be-tarask
- en-simple
---
# Dataset Card for Wikimedia Wikipedia
## Table of Contents
- [Table of Contents](#table-of-contents)
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:** [https://dumps.wikimedia.org](https://dumps.wikimedia.org)
- **Repository:**
- **Paper:**
- **Point of Contact:**
### Dataset Summary
Wikipedia dataset containing cleaned articles of all languages.
The dataset is built from the Wikipedia dumps (https://dumps.wikimedia.org/)
with one subset per language, each containing a single train split.
Each example contains the content of one full Wikipedia article with cleaning to strip
markdown and unwanted sections (references, etc.).
All language subsets have already been processed for recent dump, and you can load them per date and language this way:
```python
from datasets import load_dataset
ds = load_dataset("wikimedia/wikipedia", "20231101.en")
```
#### Data Visualization
Click the [Nomic Atlas](https://atlas.nomic.ai/map/475c26d7-b142-4795-9887-02b6eeb18dc0/0d312be6-a3bb-4586-b6b7-53dcd0cbefa5) map below to visualize the 6.4 million samples in the `20231101.en` split.
<a href="https://atlas.nomic.ai/map/475c26d7-b142-4795-9887-02b6eeb18dc0/0d312be6-a3bb-4586-b6b7-53dcd0cbefa5">
<img src="https://cdn-uploads.huggingface.co/production/uploads/6480c476cacb1c4a0696eeb8/sZNN6Vubc0Oue83vKaJUu.webp" alt="Nomic-Atlas Wikipedia Map" width="25%"/>
</a>
### Supported Tasks and Leaderboards
The dataset is generally used for Language Modeling.
### Languages
You can find the list of languages here: https://meta.wikimedia.org/wiki/List_of_Wikipedias
## Dataset Structure
### Data Instances
An example looks as follows:
```
{'id': '1',
'url': 'https://simple.wikipedia.org/wiki/April',
'title': 'April',
'text': 'April is the fourth month...'
}
```
### Data Fields
The data fields are the same among all configurations:
- `id` (`str`): ID of the article.
- `url` (`str`): URL of the article.
- `title` (`str`): Title of the article.
- `text` (`str`): Text content of the article.
### Data Splits
All configurations contain a single `train` split.
## Dataset Creation
### Curation Rationale
[More Information Needed]
### Source Data
#### Initial Data Collection and Normalization
The dataset is built from the Wikipedia dumps: https://dumps.wikimedia.org
You can find the full list of languages and dates here: https://dumps.wikimedia.org/backup-index.html
The articles have been parsed using the [`mwparserfromhell`](https://mwparserfromhell.readthedocs.io) tool.
When uploading the data files for the 20231101 dump, we noticed that the Wikimedia Dumps website does not contain this date dump
for the "bbc", "dga", nor "zgh" Wikipedias. We have reported the issue to the Wikimedia Phabricator: https://phabricator.wikimedia.org/T351761
#### Who are the source language producers?
[More Information Needed]
### Annotations
#### Annotation process
[More Information Needed]
#### Who are the annotators?
[More Information Needed]
### Personal and Sensitive Information
[More Information Needed]
## Considerations for Using the Data
### Social Impact of Dataset
[More Information Needed]
### Discussion of Biases
[More Information Needed]
### Other Known Limitations
[More Information Needed]
## Additional Information
### Dataset Curators
[More Information Needed]
### Licensing Information
Copyright licensing information: https://dumps.wikimedia.org/legal.html
All original textual content is licensed under the [GNU Free Documentation License](https://www.gnu.org/licenses/fdl-1.3.html) (GFDL)
and the [Creative Commons Attribution-Share-Alike 3.0 License](https://creativecommons.org/licenses/by-sa/3.0/).
Some text may be available only under the Creative Commons license; see their [Terms of Use](https://foundation.wikimedia.org/wiki/Policy:Terms_of_Use) for details.
Text written by some authors may be released under additional licenses or into the public domain.
### Citation Information
```
@ONLINE{wikidump,
author = "Wikimedia Foundation",
title = "Wikimedia Downloads",
url = "https://dumps.wikimedia.org"
}
```
提供机构:
sssOrganization



