EliMC/finewiki
收藏Hugging Face2025-12-05 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/EliMC/finewiki
下载链接
链接失效反馈官方服务:
资源简介:
---
license:
- cc-by-sa-4.0
- gfdl
task_categories:
- text-generation
pretty_name: 🌐 FineWiki
configs:
- config_name: ab
data_files:
- split: train
path: data/abwiki/*
- config_name: ace
data_files:
- split: train
path: data/acewiki/*
- config_name: ady
data_files:
- split: train
path: data/adywiki/*
- config_name: af
data_files:
- split: train
path: data/afwiki/*
- config_name: als
data_files:
- split: train
path: data/alswiki/*
- config_name: alt
data_files:
- split: train
path: data/altwiki/*
- config_name: ami
data_files:
- split: train
path: data/amiwiki/*
- config_name: am
data_files:
- split: train
path: data/amwiki/*
- config_name: ang
data_files:
- split: train
path: data/angwiki/*
- config_name: anp
data_files:
- split: train
path: data/anpwiki/*
- config_name: an
data_files:
- split: train
path: data/anwiki/*
- config_name: arc
data_files:
- split: train
path: data/arcwiki/*
- config_name: ar
data_files:
- split: train
path: data/arwiki/*
- config_name: ary
data_files:
- split: train
path: data/arywiki/*
- config_name: arz
data_files:
- split: train
path: data/arzwiki/*
- config_name: ast
data_files:
- split: train
path: data/astwiki/*
- config_name: as
data_files:
- split: train
path: data/aswiki/*
- config_name: atj
data_files:
- split: train
path: data/atjwiki/*
- config_name: avk
data_files:
- split: train
path: data/avkwiki/*
- config_name: av
data_files:
- split: train
path: data/avwiki/*
- config_name: awa
data_files:
- split: train
path: data/awawiki/*
- config_name: ay
data_files:
- split: train
path: data/aywiki/*
- config_name: azb
data_files:
- split: train
path: data/azbwiki/*
- config_name: az
data_files:
- split: train
path: data/azwiki/*
- config_name: ban
data_files:
- split: train
path: data/banwiki/*
- config_name: bar
data_files:
- split: train
path: data/barwiki/*
- config_name: bat_smg
data_files:
- split: train
path: data/bat_smgwiki/*
- config_name: ba
data_files:
- split: train
path: data/bawiki/*
- config_name: bbc
data_files:
- split: train
path: data/bbcwiki/*
- config_name: bcl
data_files:
- split: train
path: data/bclwiki/*
- config_name: be
data_files:
- split: train
path: data/bewiki/*
- config_name: bg
data_files:
- split: train
path: data/bgwiki/*
- config_name: bh
data_files:
- split: train
path: data/bhwiki/*
- config_name: bi
data_files:
- split: train
path: data/biwiki/*
- config_name: bjn
data_files:
- split: train
path: data/bjnwiki/*
- config_name: blk
data_files:
- split: train
path: data/blkwiki/*
- config_name: bm
data_files:
- split: train
path: data/bmwiki/*
- config_name: bn
data_files:
- split: train
path: data/bnwiki/*
- config_name: bo
data_files:
- split: train
path: data/bowiki/*
- config_name: bpy
data_files:
- split: train
path: data/bpywiki/*
- config_name: br
data_files:
- split: train
path: data/brwiki/*
- config_name: bs
data_files:
- split: train
path: data/bswiki/*
- config_name: bug
data_files:
- split: train
path: data/bugwiki/*
- config_name: bxr
data_files:
- split: train
path: data/bxrwiki/*
- config_name: ca
data_files:
- split: train
path: data/cawiki/*
- config_name: cbk_zam
data_files:
- split: train
path: data/cbk_zamwiki/*
- config_name: cdo
data_files:
- split: train
path: data/cdowiki/*
- config_name: ceb
data_files:
- split: train
path: data/cebwiki/*
- config_name: ce
data_files:
- split: train
path: data/cewiki/*
- config_name: chr
data_files:
- split: train
path: data/chrwiki/*
- config_name: ch
data_files:
- split: train
path: data/chwiki/*
- config_name: chy
data_files:
- split: train
path: data/chywiki/*
- config_name: ckb
data_files:
- split: train
path: data/ckbwiki/*
- config_name: co
data_files:
- split: train
path: data/cowiki/*
- config_name: crh
data_files:
- split: train
path: data/crhwiki/*
- config_name: cr
data_files:
- split: train
path: data/crwiki/*
- config_name: csb
data_files:
- split: train
path: data/csbwiki/*
- config_name: cs
data_files:
- split: train
path: data/cswiki/*
- config_name: cu
data_files:
- split: train
path: data/cuwiki/*
- config_name: cv
data_files:
- split: train
path: data/cvwiki/*
- config_name: cy
data_files:
- split: train
path: data/cywiki/*
- config_name: dag
data_files:
- split: train
path: data/dagwiki/*
- config_name: da
data_files:
- split: train
path: data/dawiki/*
- config_name: de
data_files:
- split: train
path: data/dewiki/*
- config_name: dga
data_files:
- split: train
path: data/dgawiki/*
- config_name: din
data_files:
- split: train
path: data/dinwiki/*
- config_name: diq
data_files:
- split: train
path: data/diqwiki/*
- config_name: dsb
data_files:
- split: train
path: data/dsbwiki/*
- config_name: dty
data_files:
- split: train
path: data/dtywiki/*
- config_name: dv
data_files:
- split: train
path: data/dvwiki/*
- config_name: dz
data_files:
- split: train
path: data/dzwiki/*
- config_name: ee
data_files:
- split: train
path: data/eewiki/*
- config_name: el
data_files:
- split: train
path: data/elwiki/*
- config_name: eml
data_files:
- split: train
path: data/emlwiki/*
- config_name: en
default: true
data_files:
- split: train
path: data/enwiki/*
- config_name: eo
data_files:
- split: train
path: data/eowiki/*
- config_name: es
data_files:
- split: train
path: data/eswiki/*
- config_name: et
data_files:
- split: train
path: data/etwiki/*
- config_name: eu
data_files:
- split: train
path: data/euwiki/*
- config_name: ext
data_files:
- split: train
path: data/extwiki/*
- config_name: fat
data_files:
- split: train
path: data/fatwiki/*
- config_name: fa
data_files:
- split: train
path: data/fawiki/*
- config_name: ff
data_files:
- split: train
path: data/ffwiki/*
- config_name: fiu_vro
data_files:
- split: train
path: data/fiu_vrowiki/*
- config_name: fi
data_files:
- split: train
path: data/fiwiki/*
- config_name: fj
data_files:
- split: train
path: data/fjwiki/*
- config_name: fon
data_files:
- split: train
path: data/fonwiki/*
- config_name: fo
data_files:
- split: train
path: data/fowiki/*
- config_name: frp
data_files:
- split: train
path: data/frpwiki/*
- config_name: frr
data_files:
- split: train
path: data/frrwiki/*
- config_name: fr
data_files:
- split: train
path: data/frwiki/*
- config_name: fur
data_files:
- split: train
path: data/furwiki/*
- config_name: fy
data_files:
- split: train
path: data/fywiki/*
- config_name: gag
data_files:
- split: train
path: data/gagwiki/*
- config_name: gan
data_files:
- split: train
path: data/ganwiki/*
- config_name: ga
data_files:
- split: train
path: data/gawiki/*
- config_name: gcr
data_files:
- split: train
path: data/gcrwiki/*
- config_name: gd
data_files:
- split: train
path: data/gdwiki/*
- config_name: glk
data_files:
- split: train
path: data/glkwiki/*
- config_name: gl
data_files:
- split: train
path: data/glwiki/*
- config_name: gn
data_files:
- split: train
path: data/gnwiki/*
- config_name: gom
data_files:
- split: train
path: data/gomwiki/*
- config_name: gor
data_files:
- split: train
path: data/gorwiki/*
- config_name: got
data_files:
- split: train
path: data/gotwiki/*
- config_name: gpe
data_files:
- split: train
path: data/gpewiki/*
- config_name: guc
data_files:
- split: train
path: data/gucwiki/*
- config_name: gur
data_files:
- split: train
path: data/gurwiki/*
- config_name: gu
data_files:
- split: train
path: data/guwiki/*
- config_name: guw
data_files:
- split: train
path: data/guwwiki/*
- config_name: gv
data_files:
- split: train
path: data/gvwiki/*
- config_name: hak
data_files:
- split: train
path: data/hakwiki/*
- config_name: ha
data_files:
- split: train
path: data/hawiki/*
- config_name: haw
data_files:
- split: train
path: data/hawwiki/*
- config_name: he
data_files:
- split: train
path: data/hewiki/*
- config_name: hif
data_files:
- split: train
path: data/hifwiki/*
- config_name: hi
data_files:
- split: train
path: data/hiwiki/*
- config_name: hr
data_files:
- split: train
path: data/hrwiki/*
- config_name: hsb
data_files:
- split: train
path: data/hsbwiki/*
- config_name: ht
data_files:
- split: train
path: data/htwiki/*
- config_name: hu
data_files:
- split: train
path: data/huwiki/*
- config_name: hy
data_files:
- split: train
path: data/hywiki/*
- config_name: hyw
data_files:
- split: train
path: data/hywwiki/*
- config_name: ia
data_files:
- split: train
path: data/iawiki/*
- config_name: id
data_files:
- split: train
path: data/idwiki/*
- config_name: ie
data_files:
- split: train
path: data/iewiki/*
- config_name: ig
data_files:
- split: train
path: data/igwiki/*
- config_name: ik
data_files:
- split: train
path: data/ikwiki/*
- config_name: ilo
data_files:
- split: train
path: data/ilowiki/*
- config_name: inh
data_files:
- split: train
path: data/inhwiki/*
- config_name: io
data_files:
- split: train
path: data/iowiki/*
- config_name: is
data_files:
- split: train
path: data/iswiki/*
- config_name: it
data_files:
- split: train
path: data/itwiki/*
- config_name: iu
data_files:
- split: train
path: data/iuwiki/*
- config_name: jam
data_files:
- split: train
path: data/jamwiki/*
- config_name: ja
data_files:
- split: train
path: data/jawiki/*
- config_name: jbo
data_files:
- split: train
path: data/jbowiki/*
- config_name: jv
data_files:
- split: train
path: data/jvwiki/*
- config_name: kaa
data_files:
- split: train
path: data/kaawiki/*
- config_name: kab
data_files:
- split: train
path: data/kabwiki/*
- config_name: ka
data_files:
- split: train
path: data/kawiki/*
- config_name: kbd
data_files:
- split: train
path: data/kbdwiki/*
- config_name: kbp
data_files:
- split: train
path: data/kbpwiki/*
- config_name: kcg
data_files:
- split: train
path: data/kcgwiki/*
- config_name: kg
data_files:
- split: train
path: data/kgwiki/*
- config_name: ki
data_files:
- split: train
path: data/kiwiki/*
- config_name: kk
data_files:
- split: train
path: data/kkwiki/*
- config_name: kl
data_files:
- split: train
path: data/klwiki/*
- config_name: km
data_files:
- split: train
path: data/kmwiki/*
- config_name: kn
data_files:
- split: train
path: data/knwiki/*
- config_name: koi
data_files:
- split: train
path: data/koiwiki/*
- config_name: ko
data_files:
- split: train
path: data/kowiki/*
- config_name: krc
data_files:
- split: train
path: data/krcwiki/*
- config_name: ksh
data_files:
- split: train
path: data/kshwiki/*
- config_name: ks
data_files:
- split: train
path: data/kswiki/*
- config_name: ku
data_files:
- split: train
path: data/kuwiki/*
- config_name: kv
data_files:
- split: train
path: data/kvwiki/*
- config_name: kw
data_files:
- split: train
path: data/kwwiki/*
- config_name: ky
data_files:
- split: train
path: data/kywiki/*
- config_name: lad
data_files:
- split: train
path: data/ladwiki/*
- config_name: la
data_files:
- split: train
path: data/lawiki/*
- config_name: lbe
data_files:
- split: train
path: data/lbewiki/*
- config_name: lb
data_files:
- split: train
path: data/lbwiki/*
- config_name: lez
data_files:
- split: train
path: data/lezwiki/*
- config_name: lfn
data_files:
- split: train
path: data/lfnwiki/*
- config_name: lg
data_files:
- split: train
path: data/lgwiki/*
- config_name: lij
data_files:
- split: train
path: data/lijwiki/*
- config_name: li
data_files:
- split: train
path: data/liwiki/*
- config_name: lld
data_files:
- split: train
path: data/lldwiki/*
- config_name: lmo
data_files:
- split: train
path: data/lmowiki/*
- config_name: ln
data_files:
- split: train
path: data/lnwiki/*
- config_name: lo
data_files:
- split: train
path: data/lowiki/*
- config_name: ltg
data_files:
- split: train
path: data/ltgwiki/*
- config_name: lt
data_files:
- split: train
path: data/ltwiki/*
- config_name: lv
data_files:
- split: train
path: data/lvwiki/*
- config_name: mad
data_files:
- split: train
path: data/madwiki/*
- config_name: mai
data_files:
- split: train
path: data/maiwiki/*
- config_name: map_bms
data_files:
- split: train
path: data/map_bmswiki/*
- config_name: mdf
data_files:
- split: train
path: data/mdfwiki/*
- config_name: mg
data_files:
- split: train
path: data/mgwiki/*
- config_name: mhr
data_files:
- split: train
path: data/mhrwiki/*
- config_name: min
data_files:
- split: train
path: data/minwiki/*
- config_name: mi
data_files:
- split: train
path: data/miwiki/*
- config_name: mk
data_files:
- split: train
path: data/mkwiki/*
- config_name: ml
data_files:
- split: train
path: data/mlwiki/*
- config_name: mni
data_files:
- split: train
path: data/mniwiki/*
- config_name: mn
data_files:
- split: train
path: data/mnwiki/*
- config_name: mnw
data_files:
- split: train
path: data/mnwwiki/*
- config_name: mrj
data_files:
- split: train
path: data/mrjwiki/*
- config_name: mr
data_files:
- split: train
path: data/mrwiki/*
- config_name: ms
data_files:
- split: train
path: data/mswiki/*
- config_name: mt
data_files:
- split: train
path: data/mtwiki/*
- config_name: mwl
data_files:
- split: train
path: data/mwlwiki/*
- config_name: myv
data_files:
- split: train
path: data/myvwiki/*
- config_name: my
data_files:
- split: train
path: data/mywiki/*
- config_name: mzn
data_files:
- split: train
path: data/mznwiki/*
- config_name: nah
data_files:
- split: train
path: data/nahwiki/*
- config_name: nap
data_files:
- split: train
path: data/napwiki/*
- config_name: nds_nl
data_files:
- split: train
path: data/nds_nlwiki/*
- config_name: nds
data_files:
- split: train
path: data/ndswiki/*
- config_name: ne
data_files:
- split: train
path: data/newiki/*
- config_name: new
data_files:
- split: train
path: data/newwiki/*
- config_name: nia
data_files:
- split: train
path: data/niawiki/*
- config_name: nl
data_files:
- split: train
path: data/nlwiki/*
- config_name: nn
data_files:
- split: train
path: data/nnwiki/*
- config_name: nov
data_files:
- split: train
path: data/novwiki/*
- config_name: "no"
data_files:
- split: train
path: data/nowiki/*
- config_name: nqo
data_files:
- split: train
path: data/nqowiki/*
- config_name: nrm
data_files:
- split: train
path: data/nrmwiki/*
- config_name: nso
data_files:
- split: train
path: data/nsowiki/*
- config_name: nv
data_files:
- split: train
path: data/nvwiki/*
- config_name: ny
data_files:
- split: train
path: data/nywiki/*
- config_name: oc
data_files:
- split: train
path: data/ocwiki/*
- config_name: olo
data_files:
- split: train
path: data/olowiki/*
- config_name: om
data_files:
- split: train
path: data/omwiki/*
- config_name: or
data_files:
- split: train
path: data/orwiki/*
- config_name: os
data_files:
- split: train
path: data/oswiki/*
- config_name: pag
data_files:
- split: train
path: data/pagwiki/*
- config_name: pam
data_files:
- split: train
path: data/pamwiki/*
- config_name: pap
data_files:
- split: train
path: data/papwiki/*
- config_name: pa
data_files:
- split: train
path: data/pawiki/*
- config_name: pcd
data_files:
- split: train
path: data/pcdwiki/*
- config_name: pcm
data_files:
- split: train
path: data/pcmwiki/*
- config_name: pdc
data_files:
- split: train
path: data/pdcwiki/*
- config_name: pfl
data_files:
- split: train
path: data/pflwiki/*
- config_name: pih
data_files:
- split: train
path: data/pihwiki/*
- config_name: pi
data_files:
- split: train
path: data/piwiki/*
- config_name: pl
data_files:
- split: train
path: data/plwiki/*
- config_name: pms
data_files:
- split: train
path: data/pmswiki/*
- config_name: pnb
data_files:
- split: train
path: data/pnbwiki/*
- config_name: pnt
data_files:
- split: train
path: data/pntwiki/*
- config_name: ps
data_files:
- split: train
path: data/pswiki/*
- config_name: pt
data_files:
- split: train
path: data/ptwiki/*
- config_name: pwn
data_files:
- split: train
path: data/pwnwiki/*
- config_name: qu
data_files:
- split: train
path: data/quwiki/*
- config_name: rm
data_files:
- split: train
path: data/rmwiki/*
- config_name: rmy
data_files:
- split: train
path: data/rmywiki/*
- config_name: rn
data_files:
- split: train
path: data/rnwiki/*
- config_name: roa_rup
data_files:
- split: train
path: data/roa_rupwiki/*
- config_name: roa_tara
data_files:
- split: train
path: data/roa_tarawiki/*
- config_name: ro
data_files:
- split: train
path: data/rowiki/*
- config_name: rue
data_files:
- split: train
path: data/ruewiki/*
- config_name: ru
data_files:
- split: train
path: data/ruwiki/*
- config_name: rw
data_files:
- split: train
path: data/rwwiki/*
- config_name: sah
data_files:
- split: train
path: data/sahwiki/*
- config_name: sat
data_files:
- split: train
path: data/satwiki/*
- config_name: sa
data_files:
- split: train
path: data/sawiki/*
- config_name: scn
data_files:
- split: train
path: data/scnwiki/*
- config_name: sco
data_files:
- split: train
path: data/scowiki/*
- config_name: sc
data_files:
- split: train
path: data/scwiki/*
- config_name: sd
data_files:
- split: train
path: data/sdwiki/*
- config_name: se
data_files:
- split: train
path: data/sewiki/*
- config_name: sg
data_files:
- split: train
path: data/sgwiki/*
- config_name: shi
data_files:
- split: train
path: data/shiwiki/*
- config_name: shn
data_files:
- split: train
path: data/shnwiki/*
- config_name: sh
data_files:
- split: train
path: data/shwiki/*
- config_name: simple
data_files:
- split: train
path: data/simplewiki/*
- config_name: si
data_files:
- split: train
path: data/siwiki/*
- config_name: skr
data_files:
- split: train
path: data/skrwiki/*
- config_name: sk
data_files:
- split: train
path: data/skwiki/*
- config_name: sl
data_files:
- split: train
path: data/slwiki/*
- config_name: smn
data_files:
- split: train
path: data/smnwiki/*
- config_name: sm
data_files:
- split: train
path: data/smwiki/*
- config_name: sn
data_files:
- split: train
path: data/snwiki/*
- config_name: so
data_files:
- split: train
path: data/sowiki/*
- config_name: sq
data_files:
- split: train
path: data/sqwiki/*
- config_name: srn
data_files:
- split: train
path: data/srnwiki/*
- config_name: sr
data_files:
- split: train
path: data/srwiki/*
- config_name: ss
data_files:
- split: train
path: data/sswiki/*
- config_name: stq
data_files:
- split: train
path: data/stqwiki/*
- config_name: st
data_files:
- split: train
path: data/stwiki/*
- config_name: su
data_files:
- split: train
path: data/suwiki/*
- config_name: sv
data_files:
- split: train
path: data/svwiki/*
- config_name: sw
data_files:
- split: train
path: data/swwiki/*
- config_name: szl
data_files:
- split: train
path: data/szlwiki/*
- config_name: szy
data_files:
- split: train
path: data/szywiki/*
- config_name: ta
data_files:
- split: train
path: data/tawiki/*
- config_name: tay
data_files:
- split: train
path: data/taywiki/*
- config_name: tcy
data_files:
- split: train
path: data/tcywiki/*
- config_name: tet
data_files:
- split: train
path: data/tetwiki/*
- config_name: te
data_files:
- split: train
path: data/tewiki/*
- config_name: tg
data_files:
- split: train
path: data/tgwiki/*
- config_name: th
data_files:
- split: train
path: data/thwiki/*
- config_name: ti
data_files:
- split: train
path: data/tiwiki/*
- config_name: tk
data_files:
- split: train
path: data/tkwiki/*
- config_name: tl
data_files:
- split: train
path: data/tlwiki/*
- config_name: tly
data_files:
- split: train
path: data/tlywiki/*
- config_name: tn
data_files:
- split: train
path: data/tnwiki/*
- config_name: to
data_files:
- split: train
path: data/towiki/*
- config_name: tpi
data_files:
- split: train
path: data/tpiwiki/*
- config_name: trv
data_files:
- split: train
path: data/trvwiki/*
- config_name: tr
data_files:
- split: train
path: data/trwiki/*
- config_name: ts
data_files:
- split: train
path: data/tswiki/*
- config_name: tt
data_files:
- split: train
path: data/ttwiki/*
- config_name: tum
data_files:
- split: train
path: data/tumwiki/*
- config_name: tw
data_files:
- split: train
path: data/twwiki/*
- config_name: tyv
data_files:
- split: train
path: data/tyvwiki/*
- config_name: ty
data_files:
- split: train
path: data/tywiki/*
- config_name: udm
data_files:
- split: train
path: data/udmwiki/*
- config_name: ug
data_files:
- split: train
path: data/ugwiki/*
- config_name: uk
data_files:
- split: train
path: data/ukwiki/*
- config_name: ur
data_files:
- split: train
path: data/urwiki/*
- config_name: uz
data_files:
- split: train
path: data/uzwiki/*
- config_name: vec
data_files:
- split: train
path: data/vecwiki/*
- config_name: vep
data_files:
- split: train
path: data/vepwiki/*
- config_name: ve
data_files:
- split: train
path: data/vewiki/*
- config_name: vi
data_files:
- split: train
path: data/viwiki/*
- config_name: vls
data_files:
- split: train
path: data/vlswiki/*
- config_name: vo
data_files:
- split: train
path: data/vowiki/*
- config_name: war
data_files:
- split: train
path: data/warwiki/*
- config_name: wa
data_files:
- split: train
path: data/wawiki/*
- config_name: wo
data_files:
- split: train
path: data/wowiki/*
- config_name: wuu
data_files:
- split: train
path: data/wuuwiki/*
- config_name: xal
data_files:
- split: train
path: data/xalwiki/*
- config_name: xh
data_files:
- split: train
path: data/xhwiki/*
- config_name: xmf
data_files:
- split: train
path: data/xmfwiki/*
- config_name: yi
data_files:
- split: train
path: data/yiwiki/*
- config_name: yo
data_files:
- split: train
path: data/yowiki/*
- config_name: za
data_files:
- split: train
path: data/zawiki/*
- config_name: zea
data_files:
- split: train
path: data/zeawiki/*
- config_name: zgh
data_files:
- split: train
path: data/zghwiki/*
- config_name: zh_classical
data_files:
- split: train
path: data/zh_classicalwiki/*
- config_name: zh_min_nan
data_files:
- split: train
path: data/zh_min_nanwiki/*
- config_name: zh_yue
data_files:
- split: train
path: data/zh_yuewiki/*
- config_name: zh
data_files:
- split: train
path: data/zhwiki/*
- config_name: zu
data_files:
- split: train
path: data/zuwiki/*
---

This is an **updated and better extracted** version of the `wikimedia/Wikipedia` dataset originally released in 2023. We carefully parsed [Wikipedia HTML dumps](https://dumps.wikimedia.org/other/enterprise_html/) from *August of 2025* covering 325 languages.
***This dataset:***
- [**fully renders templates**](https://huggingface.co/datasets/wikimedia/wikipedia/discussions/51) as it was extracted from HTML and not markdown dumps
- **removes** redirects, disambiguation, and other non main article pages
- includes **detailed metadata** such as page ID, title, last modified date, wikidate ID, version and markdown version of the text
- preserves elements and formatting such as **headings, lists, code/pre blocks, tables and math content**
- notably, `wikimedia/Wikipedia` removes all **tables and math content**
- **excludes** most of the "References", "See also", "Notes", "External links", and similar **citations/notes sections** across all languages
- besides keeping all math content, pages containing math are flagged with a **`has_math`** metadata attribute
- **extracts infoboxes** (the summary high-level information boxes on the right of some wikipedia pages) in a **structured format** into the metadata, for RAG and other uses
- only keeps pages whose **script (writing alphabet) matches** the expected list for that language
- for non-English wikis, any page fully or mostly in **English is removed** (common issue for Language Identifiers/classifiers training)
## Visualize and Compare
You can explore the dataset, compare it to `wikimedia/Wikipedia` and preview the live Wikipedia pages on our [space](https://huggingface.co/spaces/HuggingFaceFW/finewiki-viewer).
## Available subsets
| Subset | Name | Size | Pages |
|--------|------|------:|-------:|
| `en` | [English](https://en.wikipedia.org) | 35.1 GB | 6,614,655 |
| `de` | [German](https://de.wikipedia.org) | 13.1 GB | 2,713,646 |
| `fr` | [French](https://fr.wikipedia.org) | 12.1 GB | 2,566,183 |
| `ru` | [Russian](https://ru.wikipedia.org) | 10.7 GB | 1,817,813 |
| `ja` | [Japanese](https://ja.wikipedia.org) | 9.9 GB | 1,354,269 |
| `es` | [Spanish](https://es.wikipedia.org) | 8.5 GB | 1,948,965 |
| `it` | [Italian](https://it.wikipedia.org) | 7.4 GB | 1,799,759 |
| `uk` | [Ukrainian](https://uk.wikipedia.org) | 5.4 GB | 1,239,253 |
| `zh` | [Chinese (writtenvernacular Chinese)](https://zh.wikipedia.org) | 5.1 GB | 1,295,955 |
| `pl` | [Polish](https://pl.wikipedia.org) | 4.4 GB | 1,543,918 |
| `ceb` | [Cebuano](https://ceb.wikipedia.org) | 4.4 GB | 5,647,436 |
| `pt` | [Portuguese](https://pt.wikipedia.org) | 4.3 GB | 1,135,383 |
| `nl` | [Dutch](https://nl.wikipedia.org) | 3.5 GB | 2,072,865 |
| `ca` | [Catalan](https://ca.wikipedia.org) | 3.5 GB | 962,290 |
| `ar` | [Arabic](https://ar.wikipedia.org) | 3.4 GB | 1,230,456 |
| `sv` | [Swedish](https://sv.wikipedia.org) | 2.9 GB | 2,470,063 |
| `cs` | [Czech](https://cs.wikipedia.org) | 2.2 GB | 534,563 |
| `fa` | [Persian](https://fa.wikipedia.org) | 2.2 GB | 1,021,336 |
| `vi` | [Vietnamese](https://vi.wikipedia.org) | 2.1 GB | 1,279,087 |
| `hu` | [Hungarian](https://hu.wikipedia.org) | 2.1 GB | 515,004 |
| `ko` | [Korean](https://ko.wikipedia.org) | 2.0 GB | 582,035 |
| `he` | [Hebrew](https://he.wikipedia.org) | 2.0 GB | 372,053 |
| `sr` | [Serbian](https://sr.wikipedia.org) | 2.0 GB | 664,345 |
| `id` | [Indonesian](https://id.wikipedia.org) | 1.8 GB | 723,099 |
| `tr` | [Turkish](https://tr.wikipedia.org) | 1.6 GB | 629,762 |
| `fi` | [Finnish](https://fi.wikipedia.org) | 1.5 GB | 572,900 |
| `no` | [Norwegian (Bokmål)](https://no.wikipedia.org) | 1.3 GB | 620,802 |
| `el` | [Greek](https://el.wikipedia.org) | 1.2 GB | 242,517 |
| `hy` | [Armenian](https://hy.wikipedia.org) | 1.2 GB | 309,820 |
| `ro` | [Romanian](https://ro.wikipedia.org) | 1.2 GB | 493,462 |
| ... | | | |
| **Total** | | **184.7 GB** | 61,550,610|
A detailed list is available [here](https://huggingface.co/datasets/HuggingFaceFW/finewiki/blob/main/language_subsets.csv).
## How to download and use 🌐 FineWiki
See the tables above for the `subset` of the language you want to download.
We currently do not provide smaller `sample` versions, but by setting `limit` or using `streaming=True` you can easily fetch a sample of the data. If there is interest from the community we might upload smaller sampled versions later on.
### Using 🏭 [`datatrove`](https://github.com/huggingface/datatrove/)
```python
from datatrove.pipeline.readers import ParquetReader
# limit determines how many documents will be streamed (remove for all)
# this will fetch the Portuguese data
data_reader = ParquetReader("hf://datasets/HuggingFaceFW/finewiki/data/ptwiki", limit=1000)
for document in data_reader():
# do something with document
print(document)
###############################
# OR for a processing pipeline:
###############################
from datatrove.executor import LocalPipelineExecutor
from datatrove.pipeline.readers import ParquetReader
from datatrove.pipeline.filters import LambdaFilter
from datatrove.pipeline.writers import JsonlWriter
pipeline_exec = LocalPipelineExecutor(
pipeline=[
ParquetReader("hf://datasets/HuggingFaceFW/finewiki/data/ptwiki", limit=1000),
LambdaFilter(lambda doc: "hugging" in doc.text),
JsonlWriter("some-output-path")
],
tasks=10
)
pipeline_exec.run()
```
### Using `huggingface_hub`
```python
from huggingface_hub import snapshot_download
folder = snapshot_download(
"HuggingFaceFW/finewiki",
repo_type="dataset",
local_dir="./finewiki/",
# download the English subset
allow_patterns=["data/enwiki/*"])
```
### Using `datasets`
```python
from datasets import load_dataset
# get Spanish data
fw = load_dataset("HuggingFaceFW/finewiki", name="eswiki", split="train", streaming=True)
```
## Dataset Structure
### Data Instances
Example from the English subset (values truncated for readability):
```json
{
"text": "# 10th Tank Corps\nThe 10th Tank Corps was a tank corps of the Red Army, formed twice.\n\n## First Formation\nIn May–June 1938, ...",
"id": "enwiki/32552979",
"wikiname": "enwiki",
"page_id": 32552979,
"title": "10th Tank Corps",
"url": "https://en.wikipedia.org/wiki/10th_Tank_Corps",
"date_modified": "2023-07-26T12:32:03Z",
"in_language": "en",
"wikidata_id": "Q12061605",
"bytes_html": 115017,
"wikitext": "{{short description|Tank corps of the Soviet military}}\n\n{{Infobox military unit...",
"version": 1167219203,
"infoboxes": "[{\"title\": \"10th Tank Corps\", \"data\": {\"Active\": \"...\"}}]",
"has_math": false
}
```
### Data Fields
- `text` (string): cleaned, structured article text preserving headings, lists, code/pre blocks, tables and math. Has some markdown formatting (headings, tables, lists)
- `id` (string): dataset‑unique identifier; typically `<wikiname>/<page_id>`
- `wikiname` (string): wiki project name, e.g., `enwiki`, `ptwiki`
- `page_id` (int): MediaWiki page identifier
- `title` (string): article title
- `url` (string): canonical article URL
- `date_modified` (string): ISO‑8601 timestamp of the last page revision
- `in_language` (string): article language code (e.g., `en`, `pt`)
- `wikidata_id` (string|null): Wikidata QID associated with the page
- `bytes_html` (int): size in bytes of the original HTML body
- `wikitext` (string): original wikitext when available
- `version` (int|string): revision/version identifier of the page
- `infoboxes` (string): JSON‑encoded array of extracted infobox objects with title and key‑value data
- `has_math` (bool): whether math content was detected on the page
## Data Processing
The full pipeline processing code is available [here](https://huggingface.co/datasets/HuggingFaceFW/finewiki/tree/main/src). It runs on [datatrove](https://github.com/huggingface/datatrove/). While we tried to offer robust support for most language variants of Wikipedia, the lack standardization on the HTML level means that for some subsets the extraction might be sub-optimal. If this is the case for the languages you are interested in, we recommend adapting our code to address your specific concerns.
### Downloading
We used the Wikimedia Enterprise HTML dump API (`https://api.enterprise.wikimedia.com/v2/snapshots`) and downloaded main-namespace (NS0) snapshots for the different language versions of Wikipedia. We intentionally relied on pre-rendered HTML over the more commonly used wikitex/markdown dumps:
wikitext often encodes templates and formatting as parser functions/macros, which makes large sections of wikipages harder to reconstruct faithfully, whereas the Enterprise HTML already expands those structures. Snapshots from August of 2025 were used. We record rich per‑page attributes (IDs, titles, URLs, language, version, timestamps, Wikidata IDs) as part of the metadata.
### Extraction
We heavily adapted [mwparserfromhtml](https://pypi.org/project/mwparserfromhtml/) to parse the HTML content into a clean, structured text representation that preserves meaningful formatting. Redirect and disambiguation pages are removed reliably (via redirect markers in wikitext/HTML and disambiguation signals, including Wikidata IDs and page‑props). Reference‑like sections filled with non-article unnatural content (e.g., “References”, “Notes”, “External links”, localized per language) are excluded using a curated heading list and structural cues (reference list containers), so citations/notes are dropped without harming the main body. Visual/navigation boilerplate (ToC, navboxes, messageboxes, authority control, categories) is filtered out, while infoboxes are carefully extracted into the metadata into key-value structured data that can be useful for knowledge search applications. We additionally strive to keep math content (and mark pages containing it with a `has_math` flag) as well as tables, where much of the Wikipedia knowledge is contained.
### Filtering
One common issue with low-resource language Wikipedias is the large prevelance of content from other languages, particularly English (often from articles or boilerplate pages copied over from the English Wikipedia). To ensure language quality and consistency, we apply language‑ and script‑aware checks tailored to each wiki. Pages are kept only if their predicted writing system matches the expected scripts for that language. For non‑English wikis, pages that are predominantly English above a confidence threshold are removed to reduce cross‑language leakage. We also drop ultra‑short pages without infoboxes to avoid low‑signal content.
## Licensing Information
This dataset contains text from Wikipedia, licensed under Creative Commons Attribution-ShareAlike 4.0 (CC BY-SA 4.0) and also available under GFDL. See Wikipedia’s licensing and Terms of Use: https://dumps.wikimedia.org/legal.html
Our processed release is an adaptation of that text and is licensed under CC BY-SA 4.0.
## Citation Information
```bibtex
@dataset{penedo2025finewiki,
author = {Guilherme Penedo},
title = {FineWiki},
year = {2025},
publisher = {Hugging Face Datasets},
url = {https://huggingface.co/datasets/HuggingFaceFW/finewiki},
urldate = {2025-10-20},
note = {Source: Wikimedia Enterprise Snapshot API (https://api.enterprise.wikimedia.com/v2/snapshots). Text licensed under CC BY-SA 4.0 with attribution to Wikipedia contributors.}
}
```
### 数据集元数据
许可证:
- 知识共享署名-相同方式共享4.0协议(CC BY-SA 4.0)
- GNU自由文档许可证(GFDL)
任务类别:
- 文本生成
展示名称:🌐 FineWiki
配置集:包含325个语言子集配置,每个配置对应一种语言的维基百科数据,训练集数据路径格式为`data/{配置名称}wiki/*`,完整配置列表可参阅[此处](https://huggingface.co/datasets/HuggingFaceFW/finewiki/blob/main/language_subsets.csv)

本数据集是2023年发布的`wikimedia/Wikipedia`数据集的更新优化提取版本。我们于2025年8月对覆盖325种语言的[维基媒体HTML转储文件](https://dumps.wikimedia.org/other/enterprise_html/)进行了精细解析。
#### 本数据集特性:
- 由于从HTML而非Markdown转储文件中提取,本数据集**完整渲染模板**(相关讨论见[此处](https://huggingface.co/datasets/wikimedia/wikipedia/discussions/51))
- **移除**重定向页面、消歧义页面及其他非主条目页面
- 包含**丰富元数据**,例如页面ID、条目标题、最后修改时间、Wikidata ID、版本号及文本的Markdown格式版本
- 保留各类元素与格式,包括**标题、列表、代码/预格式化块、表格与数学公式内容**
- 值得注意的是,原`wikimedia/Wikipedia`数据集会移除所有**表格与数学公式内容**
- 在所有语言版本中,**排除**绝大多数“参考文献”“相关条目”“注释”“外部链接”等**引用/注释类章节**
- 除完整保留数学公式内容外,包含数学公式的页面会通过**`has_math`元数据属性**进行标记
- **以结构化格式提取信息框**(部分维基百科页面右侧的概要高级信息框)并写入元数据,适用于检索增强生成(Retrieval-Augmented Generation,RAG)及其他应用场景
- 仅保留**书写脚本(文字字母系统)符合对应语言标准脚本列表**的页面
- 对于非英语维基,所有完全或主要使用**英语**的页面将被移除(这是语言识别/分类器训练中常见的跨语言泄漏问题)
#### 可视化与对比
您可通过我们的[在线空间](https://huggingface.co/spaces/HuggingFaceFW/finewiki-viewer)浏览本数据集,与`wikimedia/Wikipedia`数据集进行对比,并预览实时维基百科页面。
#### 可用子集
| 子集代码 | 语言名称 | 大小 | 条目数 |
|--------|------|------:|-------:|
| `en` | [英语](https://en.wikipedia.org) | 35.1 GB | 6,614,655 |
| `de` | [德语](https://de.wikipedia.org) | 13.1 GB | 2,713,646 |
| `fr` | [法语](https://fr.wikipedia.org) | 12.1 GB | 2,566,183 |
| `ru` | [俄语](https://ru.wikipedia.org) | 10.7 GB | 1,817,813 |
| `ja` | [日语](https://ja.wikipedia.org) | 9.9 GB | 1,354,269 |
| `es` | [西班牙语](https://es.wikipedia.org) | 8.5 GB | 1,948,965 |
| `it` | [意大利语](https://it.wikipedia.org) | 7.4 GB | 1,799,759 |
| `uk` | [乌克兰语](https://uk.wikipedia.org) | 5.4 GB | 1,239,253 |
| `zh` | [中文(书面语)](https://zh.wikipedia.org) | 5.1 GB | 1,295,955 |
| `pl` | [波兰语](https://pl.wikipedia.org) | 4.4 GB | 1,543,918 |
| `ceb` | [宿务语](https://ceb.wikipedia.org) | 4.4 GB | 5,647,436 |
| `pt` | [葡萄牙语](https://pt.wikipedia.org) | 4.3 GB | 1,135,383 |
| `nl` | [荷兰语](https://nl.wikipedia.org) | 3.5 GB | 2,072,865 |
| `ca` | [加泰罗尼亚语](https://ca.wikipedia.org) | 3.5 GB | 962,290 |
| `ar` | [阿拉伯语](https://ar.wikipedia.org) | 3.4 GB | 1,230,456 |
| `sv` | [瑞典语](https://sv.wikipedia.org) | 2.9 GB | 2,470,063 |
| `cs` | [捷克语](https://cs.wikipedia.org) | 2.2 GB | 534,563 |
| `fa` | [波斯语](https://fa.wikipedia.org) | 2.2 GB | 1,021,336 |
| `vi` | [越南语](https://vi.wikipedia.org) | 2.1 GB | 1,279,087 |
| `hu` | [匈牙利语](https://hu.wikipedia.org) | 2.1 GB | 515,004 |
| `ko` | [韩语](https://ko.wikipedia.org) | 2.0 GB | 582,035 |
| `he` | [希伯来语](https://he.wikipedia.org) | 2.0 GB | 372,053 |
| `sr` | [塞尔维亚语](https://sr.wikipedia.org) | 2.0 GB | 664,345 |
| `id` | [印度尼西亚语](https://id.wikipedia.org) | 1.8 GB | 723,099 |
| `tr` | [土耳其语](https://tr.wikipedia.org) | 1.6 GB | 629,762 |
| `fi` | [芬兰语](https://fi.wikipedia.org) | 1.5 GB | 572,900 |
| `no` | [挪威语(博克马尔语)](https://no.wikipedia.org) | 1.3 GB | 620,802 |
| `el` | [希腊语](https://el.wikipedia.org) | 1.2 GB | 242,517 |
| `hy` | [亚美尼亚语](https://hy.wikipedia.org) | 1.2 GB | 309,820 |
| `ro` | [罗马尼亚语](https://ro.wikipedia.org) | 1.2 GB | 493,462 |
| ... | | | |
| **总计** | | **184.7 GB** | 61,550,610 |
详细子集列表可参阅[此处](https://huggingface.co/datasets/HuggingFaceFW/finewiki/blob/main/language_subsets.csv)。
#### 下载与使用 🌐 FineWiki
请根据上方表格选择您需要的语言子集进行下载。目前我们未提供精简采样版本,但您可通过设置`limit`参数或启用`streaming=True`轻松获取数据样本。若社区有相关需求,我们后续可能会上传精简采样版本。
##### 使用 🏭 [`datatrove`](https://github.com/huggingface/datatrove/)
python
from datatrove.pipeline.readers import ParquetReader
# limit determines how many documents will be streamed (remove for all)
# 该参数用于限制流式读取的文档数量(移除则读取全部数据)
# this will fetch the Portuguese data
# 以下代码将获取葡萄牙语数据集
data_reader = ParquetReader("hf://datasets/HuggingFaceFW/finewiki/data/ptwiki", limit=1000)
for document in data_reader():
# do something with document
# 对读取到的文档执行自定义操作
print(document)
###############################
# OR for a processing pipeline:
# 或用于构建处理流水线
###############################
from datatrove.executor import LocalPipelineExecutor
from datatrove.pipeline.readers import ParquetReader
from datatrove.pipeline.filters import LambdaFilter
from datatrove.pipeline.writers import JsonlWriter
pipeline_exec = LocalPipelineExecutor(
pipeline=[
ParquetReader("hf://datasets/HuggingFaceFW/finewiki/data/ptwiki", limit=1000),
LambdaFilter(lambda doc: "hugging" in doc.text),
JsonlWriter("some-output-path")
],
tasks=10
)
pipeline_exec.run()
##### 使用 `huggingface_hub`
python
from huggingface_hub import snapshot_download
folder = snapshot_download(
"HuggingFaceFW/finewiki",
repo_type="dataset",
local_dir="./finewiki/",
# download the English subset
# 下载英语子集
allow_patterns=["data/enwiki/*"])
##### 使用 `datasets`
python
from datasets import load_dataset
# get Spanish data
# 获取西班牙语数据集
fw = load_dataset("HuggingFaceFW/finewiki", name="eswiki", split="train", streaming=True)
#### 数据集结构
##### 数据实例
以下为英语子集的示例(为便于阅读已截断部分内容):
json
{
"text": "# 10th Tank Corps
The 10th Tank Corps was a tank corps of the Red Army, formed twice.
## First Formation
In May–June 1938, ...",
"id": "enwiki/32552979",
"wikiname": "enwiki",
"page_id": 32552979,
"title": "10th Tank Corps",
"url": "https://en.wikipedia.org/wiki/10th_Tank_Corps",
"date_modified": "2023-07-26T12:32:03Z",
"in_language": "en",
"wikidata_id": "Q12061605",
"bytes_html": 115017,
"wikitext": "{{short description|Tank corps of the Soviet military}}
{{Infobox military unit...",
"version": 1167219203,
"infoboxes": "[{"title": "10th Tank Corps", "data": {"Active": "..."}}]",
"has_math": false
}
##### 数据字段
- `text`(字符串类型):经过清洗的结构化条目文本,保留标题、列表、代码/预格式化块、表格与数学公式内容,包含基础Markdown格式(标题、表格、列表)
- `id`(字符串类型):数据集全局唯一标识符,格式通常为`<wikiname>/<page_id>`
- `wikiname`(字符串类型):维基项目名称,例如`enwiki`、`ptwiki`
- `page_id`(整数类型):MediaWiki页面标识符
- `title`(字符串类型):条目标题
- `url`(字符串类型):条目标准URL
- `date_modified`(字符串类型):页面最新修订版本的ISO-8601格式时间戳
- `in_language`(字符串类型):条目语言代码(例如`en`、`pt`)
- `wikidata_id`(字符串/空值类型):关联至该条目的Wikidata QID编号
- `bytes_html`(整数类型):原始HTML正文的字节大小
- `wikitext`(字符串类型):可用时的原始维基文本
- `version`(整数/字符串类型):页面的修订/版本标识符
- `infoboxes`(字符串类型):经过JSON编码的提取信息框数组,包含信息框标题与键值对数据
- `has_math`(布尔类型):标识页面是否包含数学公式内容
#### 数据处理
完整的流水线处理代码可参阅[此处](https://huggingface.co/datasets/HuggingFaceFW/finewiki/tree/main/src),该代码基于[datatrove](https://github.com/huggingface/datatrove/)框架开发。尽管我们已尽力为绝大多数维基百科语言变体提供鲁棒的支持,但由于HTML层面缺乏统一标准,部分语言子集的提取效果可能未尽理想。若您关注的语言存在此类问题,我们建议您适配本代码以满足特定需求。
##### 数据下载
我们使用维基媒体企业级HTML转储API(`https://api.enterprise.wikimedia.com/v2/snapshots`)下载了不同语言版本维基百科的主命名空间(NS0)转储快照。我们有意选择预渲染HTML而非更常用的维基文本/Markdown转储文件:维基文本通常将模板与格式编码为解析器函数或宏,这会导致难以精准重建维基页面的大部分内容,而企业级HTML已完成此类结构的展开。本次处理使用了2025年8月的快照数据。我们将丰富的单页属性(ID、标题、URL、语言、版本、时间戳、Wikidata ID)作为元数据的一部分进行记录。
##### 内容提取
我们对[mwparserfromhtml](https://pypi.org/project/mwparserfromhtml/)进行了大量适配,用于将HTML内容解析为整洁的结构化文本表示,同时保留有意义的格式。我们通过维基文本/HTML中的重定向标记以及包括Wikidata ID、页面属性在内的消歧义信号,可靠地移除了重定向页面与消歧义页面。我们使用精心整理的标题列表与结构线索(引用列表容器),排除了包含非条目类无关内容的类引用章节(例如各语言本地化的“参考文献”“注释”“外部链接”),因此在不损害主体内容的前提下移除了引用/注释内容。我们过滤了视觉/导航类冗余代码(目录、导航框、提示框、权威控制、分类标签),同时将信息框精准提取为键值对结构化数据并写入元数据,该数据可用于知识搜索等应用场景。我们还致力于完整保留数学公式内容(并通过`has_math`标记包含数学公式的页面)以及承载大量维基百科知识的表格内容。
##### 数据过滤
小语种维基百科的一个常见问题是存在大量其他语言的内容,尤其是英语(通常是从英语维基复制的条目或冗余页面)。为确保语言质量与一致性,我们针对每个维基项目应用了基于语言与脚本的定制检查。仅当页面预测的书写系统符合对应语言的标准脚本时,才会保留该页面。对于非英语维基,所有英语占比超过置信阈值的页面将被移除,以减少跨语言泄漏。我们同时移除了不含信息框的超短页面,以避免低信息量内容。
#### 许可信息
本数据集包含来自维基百科的文本内容,遵循知识共享署名-相同方式共享4.0协议(CC BY-SA 4.0),同时也可使用GNU自由文档许可证(GFDL)。有关维基百科的许可与使用条款,请参阅:https://dumps.wikimedia.org/legal.html。本数据集为原维基文本的加工衍生版本,同样遵循CC BY-SA 4.0协议进行许可。
#### 引用信息
bibtex
@dataset{penedo2025finewiki,
author = {Guilherme Penedo},
title = {FineWiki},
year = {2025},
publisher = {Hugging Face Datasets},
url = {https://huggingface.co/datasets/HuggingFaceFW/finewiki},
urldate = {2025-10-20},
note = {Source: Wikimedia Enterprise Snapshot API (https://api.enterprise.wikimedia.com/v2/snapshots). Text licensed under CC BY-SA 4.0 with attribution to Wikipedia contributors.}
}
提供机构:
EliMC



