five

open-index/fineweb-2-nlp

收藏
Hugging Face2026-04-16 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/open-index/fineweb-2-nlp
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: odc-by task_categories: - text-generation - feature-extraction - text-classification language: - ar - gu - hi - kn - ml - mr - pa - ro - ta - te - ur - vi pretty_name: "FineWeb-2 NLP" size_categories: - 100M<n<1B tags: - parquet - fineweb-2 - nlp - sentences - paragraphs - words - ngrams - multilingual configs: - config_name: sentences data_files: - split: train path: data/sentences/**/*.parquet - config_name: paragraphs data_files: - split: train path: data/paragraphs/**/*.parquet - config_name: words data_files: - split: train path: data/words/**/*.parquet - config_name: ngrams data_files: - split: train path: data/ngrams/**/*.parquet - config_name: sentences-vie_Latn data_files: - split: train path: data/sentences/vie_Latn/*.parquet - config_name: paragraphs-vie_Latn data_files: - split: train path: data/paragraphs/vie_Latn/*.parquet - config_name: words-vie_Latn data_files: - split: train path: data/words/vie_Latn/*.parquet - config_name: ngrams-vie_Latn data_files: - split: train path: data/ngrams/vie_Latn/*.parquet - config_name: sentences-ars_Arab data_files: - split: train path: data/sentences/ars_Arab/*.parquet - config_name: paragraphs-ars_Arab data_files: - split: train path: data/paragraphs/ars_Arab/*.parquet - config_name: words-ars_Arab data_files: - split: train path: data/words/ars_Arab/*.parquet - config_name: ngrams-ars_Arab data_files: - split: train path: data/ngrams/ars_Arab/*.parquet - config_name: sentences-amh_Ethi data_files: - split: train path: data/sentences/amh_Ethi/*.parquet - config_name: paragraphs-amh_Ethi data_files: - split: train path: data/paragraphs/amh_Ethi/*.parquet - config_name: words-amh_Ethi data_files: - split: train path: data/words/amh_Ethi/*.parquet - config_name: ngrams-amh_Ethi data_files: - split: train path: data/ngrams/amh_Ethi/*.parquet - config_name: sentences-epo_Latn data_files: - split: train path: data/sentences/epo_Latn/*.parquet - config_name: paragraphs-epo_Latn data_files: - split: train path: data/paragraphs/epo_Latn/*.parquet - config_name: words-epo_Latn data_files: - split: train path: data/words/epo_Latn/*.parquet - config_name: ngrams-epo_Latn data_files: - split: train path: data/ngrams/epo_Latn/*.parquet - config_name: sentences-tat_Cyrl data_files: - split: train path: data/sentences/tat_Cyrl/*.parquet - config_name: paragraphs-tat_Cyrl data_files: - split: train path: data/paragraphs/tat_Cyrl/*.parquet - config_name: words-tat_Cyrl data_files: - split: train path: data/words/tat_Cyrl/*.parquet - config_name: ngrams-tat_Cyrl data_files: - split: train path: data/ngrams/tat_Cyrl/*.parquet - config_name: sentences-hif_Latn data_files: - split: train path: data/sentences/hif_Latn/*.parquet - config_name: paragraphs-hif_Latn data_files: - split: train path: data/paragraphs/hif_Latn/*.parquet - config_name: words-hif_Latn data_files: - split: train path: data/words/hif_Latn/*.parquet - config_name: ngrams-hif_Latn data_files: - split: train path: data/ngrams/hif_Latn/*.parquet - config_name: sentences-xho_Latn data_files: - split: train path: data/sentences/xho_Latn/*.parquet - config_name: paragraphs-xho_Latn data_files: - split: train path: data/paragraphs/xho_Latn/*.parquet - config_name: words-xho_Latn data_files: - split: train path: data/words/xho_Latn/*.parquet - config_name: ngrams-xho_Latn data_files: - split: train path: data/ngrams/xho_Latn/*.parquet - config_name: sentences-ltz_Latn data_files: - split: train path: data/sentences/ltz_Latn/*.parquet - config_name: paragraphs-ltz_Latn data_files: - split: train path: data/paragraphs/ltz_Latn/*.parquet - config_name: words-ltz_Latn data_files: - split: train path: data/words/ltz_Latn/*.parquet - config_name: ngrams-ltz_Latn data_files: - split: train path: data/ngrams/ltz_Latn/*.parquet - config_name: sentences-gmh_Latn data_files: - split: train path: data/sentences/gmh_Latn/*.parquet - config_name: paragraphs-gmh_Latn data_files: - split: train path: data/paragraphs/gmh_Latn/*.parquet - config_name: words-gmh_Latn data_files: - split: train path: data/words/gmh_Latn/*.parquet - config_name: ngrams-gmh_Latn data_files: - split: train path: data/ngrams/gmh_Latn/*.parquet - config_name: sentences-plt_Latn data_files: - split: train path: data/sentences/plt_Latn/*.parquet - config_name: paragraphs-plt_Latn data_files: - split: train path: data/paragraphs/plt_Latn/*.parquet - config_name: words-plt_Latn data_files: - split: train path: data/words/plt_Latn/*.parquet - config_name: ngrams-plt_Latn data_files: - split: train path: data/ngrams/plt_Latn/*.parquet - config_name: sentences-gla_Latn data_files: - split: train path: data/sentences/gla_Latn/*.parquet - config_name: paragraphs-gla_Latn data_files: - split: train path: data/paragraphs/gla_Latn/*.parquet - config_name: words-gla_Latn data_files: - split: train path: data/words/gla_Latn/*.parquet - config_name: ngrams-gla_Latn data_files: - split: train path: data/ngrams/gla_Latn/*.parquet - config_name: sentences-jav_Latn data_files: - split: train path: data/sentences/jav_Latn/*.parquet - config_name: paragraphs-jav_Latn data_files: - split: train path: data/paragraphs/jav_Latn/*.parquet - config_name: words-jav_Latn data_files: - split: train path: data/words/jav_Latn/*.parquet - config_name: ngrams-jav_Latn data_files: - split: train path: data/ngrams/jav_Latn/*.parquet - config_name: sentences-fao_Latn data_files: - split: train path: data/sentences/fao_Latn/*.parquet - config_name: paragraphs-fao_Latn data_files: - split: train path: data/paragraphs/fao_Latn/*.parquet - config_name: words-fao_Latn data_files: - split: train path: data/words/fao_Latn/*.parquet - config_name: ngrams-fao_Latn data_files: - split: train path: data/ngrams/fao_Latn/*.parquet - config_name: sentences-fry_Latn data_files: - split: train path: data/sentences/fry_Latn/*.parquet - config_name: paragraphs-fry_Latn data_files: - split: train path: data/paragraphs/fry_Latn/*.parquet - config_name: words-fry_Latn data_files: - split: train path: data/words/fry_Latn/*.parquet - config_name: ngrams-fry_Latn data_files: - split: train path: data/ngrams/fry_Latn/*.parquet - config_name: sentences-yue_Hani data_files: - split: train path: data/sentences/yue_Hani/*.parquet - config_name: paragraphs-yue_Hani data_files: - split: train path: data/paragraphs/yue_Hani/*.parquet - config_name: words-yue_Hani data_files: - split: train path: data/words/yue_Hani/*.parquet - config_name: ngrams-yue_Hani data_files: - split: train path: data/ngrams/yue_Hani/*.parquet - config_name: sentences-hat_Latn data_files: - split: train path: data/sentences/hat_Latn/*.parquet - config_name: paragraphs-hat_Latn data_files: - split: train path: data/paragraphs/hat_Latn/*.parquet - config_name: words-hat_Latn data_files: - split: train path: data/words/hat_Latn/*.parquet - config_name: ngrams-hat_Latn data_files: - split: train path: data/ngrams/hat_Latn/*.parquet - config_name: sentences-tuk_Latn data_files: - split: train path: data/sentences/tuk_Latn/*.parquet - config_name: paragraphs-tuk_Latn data_files: - split: train path: data/paragraphs/tuk_Latn/*.parquet - config_name: words-tuk_Latn data_files: - split: train path: data/words/tuk_Latn/*.parquet - config_name: ngrams-tuk_Latn data_files: - split: train path: data/ngrams/tuk_Latn/*.parquet - config_name: sentences-pap_Latn data_files: - split: train path: data/sentences/pap_Latn/*.parquet - config_name: paragraphs-pap_Latn data_files: - split: train path: data/paragraphs/pap_Latn/*.parquet - config_name: words-pap_Latn data_files: - split: train path: data/words/pap_Latn/*.parquet - config_name: ngrams-pap_Latn data_files: - split: train path: data/ngrams/pap_Latn/*.parquet - config_name: sentences-asm_Beng data_files: - split: train path: data/sentences/asm_Beng/*.parquet - config_name: paragraphs-asm_Beng data_files: - split: train path: data/paragraphs/asm_Beng/*.parquet - config_name: words-asm_Beng data_files: - split: train path: data/words/asm_Beng/*.parquet - config_name: ngrams-asm_Beng data_files: - split: train path: data/ngrams/asm_Beng/*.parquet - config_name: sentences-ceb_Latn data_files: - split: train path: data/sentences/ceb_Latn/*.parquet - config_name: paragraphs-ceb_Latn data_files: - split: train path: data/paragraphs/ceb_Latn/*.parquet - config_name: words-ceb_Latn data_files: - split: train path: data/words/ceb_Latn/*.parquet - config_name: ngrams-ceb_Latn data_files: - split: train path: data/ngrams/ceb_Latn/*.parquet - config_name: sentences-lao_Laoo data_files: - split: train path: data/sentences/lao_Laoo/*.parquet - config_name: paragraphs-lao_Laoo data_files: - split: train path: data/paragraphs/lao_Laoo/*.parquet - config_name: words-lao_Laoo data_files: - split: train path: data/words/lao_Laoo/*.parquet - config_name: ngrams-lao_Laoo data_files: - split: train path: data/ngrams/lao_Laoo/*.parquet - config_name: sentences-bak_Cyrl data_files: - split: train path: data/sentences/bak_Cyrl/*.parquet - config_name: paragraphs-bak_Cyrl data_files: - split: train path: data/paragraphs/bak_Cyrl/*.parquet - config_name: words-bak_Cyrl data_files: - split: train path: data/words/bak_Cyrl/*.parquet - config_name: ngrams-bak_Cyrl data_files: - split: train path: data/ngrams/bak_Cyrl/*.parquet - config_name: sentences-kin_Latn data_files: - split: train path: data/sentences/kin_Latn/*.parquet - config_name: paragraphs-kin_Latn data_files: - split: train path: data/paragraphs/kin_Latn/*.parquet - config_name: words-kin_Latn data_files: - split: train path: data/words/kin_Latn/*.parquet - config_name: ngrams-kin_Latn data_files: - split: train path: data/ngrams/kin_Latn/*.parquet - config_name: sentences-mri_Latn data_files: - split: train path: data/sentences/mri_Latn/*.parquet - config_name: paragraphs-mri_Latn data_files: - split: train path: data/paragraphs/mri_Latn/*.parquet - config_name: words-mri_Latn data_files: - split: train path: data/words/mri_Latn/*.parquet - config_name: ngrams-mri_Latn data_files: - split: train path: data/ngrams/mri_Latn/*.parquet - config_name: sentences-mww_Latn data_files: - split: train path: data/sentences/mww_Latn/*.parquet - config_name: paragraphs-mww_Latn data_files: - split: train path: data/paragraphs/mww_Latn/*.parquet - config_name: words-mww_Latn data_files: - split: train path: data/words/mww_Latn/*.parquet - config_name: ngrams-mww_Latn data_files: - split: train path: data/ngrams/mww_Latn/*.parquet - config_name: sentences-zul_Latn data_files: - split: train path: data/sentences/zul_Latn/*.parquet - config_name: paragraphs-zul_Latn data_files: - split: train path: data/paragraphs/zul_Latn/*.parquet - config_name: words-zul_Latn data_files: - split: train path: data/words/zul_Latn/*.parquet - config_name: ngrams-zul_Latn data_files: - split: train path: data/ngrams/zul_Latn/*.parquet - config_name: sentences-snd_Arab data_files: - split: train path: data/sentences/snd_Arab/*.parquet - config_name: paragraphs-snd_Arab data_files: - split: train path: data/paragraphs/snd_Arab/*.parquet - config_name: words-snd_Arab data_files: - split: train path: data/words/snd_Arab/*.parquet - config_name: ngrams-snd_Arab data_files: - split: train path: data/ngrams/snd_Arab/*.parquet - config_name: sentences-sun_Latn data_files: - split: train path: data/sentences/sun_Latn/*.parquet - config_name: paragraphs-sun_Latn data_files: - split: train path: data/paragraphs/sun_Latn/*.parquet - config_name: words-sun_Latn data_files: - split: train path: data/words/sun_Latn/*.parquet - config_name: ngrams-sun_Latn data_files: - split: train path: data/ngrams/sun_Latn/*.parquet - config_name: sentences-cos_Latn data_files: - split: train path: data/sentences/cos_Latn/*.parquet - config_name: paragraphs-cos_Latn data_files: - split: train path: data/paragraphs/cos_Latn/*.parquet - config_name: words-cos_Latn data_files: - split: train path: data/words/cos_Latn/*.parquet - config_name: ngrams-cos_Latn data_files: - split: train path: data/ngrams/cos_Latn/*.parquet - config_name: sentences-nya_Latn data_files: - split: train path: data/sentences/nya_Latn/*.parquet - config_name: paragraphs-nya_Latn data_files: - split: train path: data/paragraphs/nya_Latn/*.parquet - config_name: words-nya_Latn data_files: - split: train path: data/words/nya_Latn/*.parquet - config_name: ngrams-nya_Latn data_files: - split: train path: data/ngrams/nya_Latn/*.parquet - config_name: sentences-nap_Latn data_files: - split: train path: data/sentences/nap_Latn/*.parquet - config_name: paragraphs-nap_Latn data_files: - split: train path: data/paragraphs/nap_Latn/*.parquet - config_name: words-nap_Latn data_files: - split: train path: data/words/nap_Latn/*.parquet - config_name: ngrams-nap_Latn data_files: - split: train path: data/ngrams/nap_Latn/*.parquet - config_name: sentences-smo_Latn data_files: - split: train path: data/sentences/smo_Latn/*.parquet - config_name: paragraphs-smo_Latn data_files: - split: train path: data/paragraphs/smo_Latn/*.parquet - config_name: words-smo_Latn data_files: - split: train path: data/words/smo_Latn/*.parquet - config_name: ngrams-smo_Latn data_files: - split: train path: data/ngrams/smo_Latn/*.parquet - config_name: sentences-sot_Latn data_files: - split: train path: data/sentences/sot_Latn/*.parquet - config_name: paragraphs-sot_Latn data_files: - split: train path: data/paragraphs/sot_Latn/*.parquet - config_name: words-sot_Latn data_files: - split: train path: data/words/sot_Latn/*.parquet - config_name: ngrams-sot_Latn data_files: - split: train path: data/ngrams/sot_Latn/*.parquet - config_name: sentences-ibo_Latn data_files: - split: train path: data/sentences/ibo_Latn/*.parquet - config_name: paragraphs-ibo_Latn data_files: - split: train path: data/paragraphs/ibo_Latn/*.parquet - config_name: words-ibo_Latn data_files: - split: train path: data/words/ibo_Latn/*.parquet - config_name: ngrams-ibo_Latn data_files: - split: train path: data/ngrams/ibo_Latn/*.parquet - config_name: sentences-sna_Latn data_files: - split: train path: data/sentences/sna_Latn/*.parquet - config_name: paragraphs-sna_Latn data_files: - split: train path: data/paragraphs/sna_Latn/*.parquet - config_name: words-sna_Latn data_files: - split: train path: data/words/sna_Latn/*.parquet - config_name: ngrams-sna_Latn data_files: - split: train path: data/ngrams/sna_Latn/*.parquet - config_name: sentences-sah_Cyrl data_files: - split: train path: data/sentences/sah_Cyrl/*.parquet - config_name: paragraphs-sah_Cyrl data_files: - split: train path: data/paragraphs/sah_Cyrl/*.parquet - config_name: words-sah_Cyrl data_files: - split: train path: data/words/sah_Cyrl/*.parquet - config_name: ngrams-sah_Cyrl data_files: - split: train path: data/ngrams/sah_Cyrl/*.parquet - config_name: sentences-hin_Latn data_files: - split: train path: data/sentences/hin_Latn/*.parquet - config_name: paragraphs-hin_Latn data_files: - split: train path: data/paragraphs/hin_Latn/*.parquet - config_name: words-hin_Latn data_files: - split: train path: data/words/hin_Latn/*.parquet - config_name: ngrams-hin_Latn data_files: - split: train path: data/ngrams/hin_Latn/*.parquet - config_name: sentences-oss_Cyrl data_files: - split: train path: data/sentences/oss_Cyrl/*.parquet - config_name: paragraphs-oss_Cyrl data_files: - split: train path: data/paragraphs/oss_Cyrl/*.parquet - config_name: words-oss_Cyrl data_files: - split: train path: data/words/oss_Cyrl/*.parquet - config_name: ngrams-oss_Cyrl data_files: - split: train path: data/ngrams/oss_Cyrl/*.parquet - config_name: sentences-chv_Cyrl data_files: - split: train path: data/sentences/chv_Cyrl/*.parquet - config_name: paragraphs-chv_Cyrl data_files: - split: train path: data/paragraphs/chv_Cyrl/*.parquet - config_name: words-chv_Cyrl data_files: - split: train path: data/words/chv_Cyrl/*.parquet - config_name: ngrams-chv_Cyrl data_files: - split: train path: data/ngrams/chv_Cyrl/*.parquet - config_name: sentences-div_Thaa data_files: - split: train path: data/sentences/div_Thaa/*.parquet - config_name: paragraphs-div_Thaa data_files: - split: train path: data/paragraphs/div_Thaa/*.parquet - config_name: words-div_Thaa data_files: - split: train path: data/words/div_Thaa/*.parquet - config_name: ngrams-div_Thaa data_files: - split: train path: data/ngrams/div_Thaa/*.parquet - config_name: sentences-uig_Arab data_files: - split: train path: data/sentences/uig_Arab/*.parquet - config_name: paragraphs-uig_Arab data_files: - split: train path: data/paragraphs/uig_Arab/*.parquet - config_name: words-uig_Arab data_files: - split: train path: data/words/uig_Arab/*.parquet - config_name: ngrams-uig_Arab data_files: - split: train path: data/ngrams/uig_Arab/*.parquet - config_name: sentences-haw_Latn data_files: - split: train path: data/sentences/haw_Latn/*.parquet - config_name: paragraphs-haw_Latn data_files: - split: train path: data/paragraphs/haw_Latn/*.parquet - config_name: words-haw_Latn data_files: - split: train path: data/words/haw_Latn/*.parquet - config_name: ngrams-haw_Latn data_files: - split: train path: data/ngrams/haw_Latn/*.parquet - config_name: sentences-ydd_Hebr data_files: - split: train path: data/sentences/ydd_Hebr/*.parquet - config_name: paragraphs-ydd_Hebr data_files: - split: train path: data/paragraphs/ydd_Hebr/*.parquet - config_name: words-ydd_Hebr data_files: - split: train path: data/words/ydd_Hebr/*.parquet - config_name: ngrams-ydd_Hebr data_files: - split: train path: data/ngrams/ydd_Hebr/*.parquet - config_name: sentences-sme_Latn data_files: - split: train path: data/sentences/sme_Latn/*.parquet - config_name: paragraphs-sme_Latn data_files: - split: train path: data/paragraphs/sme_Latn/*.parquet - config_name: words-sme_Latn data_files: - split: train path: data/words/sme_Latn/*.parquet - config_name: ngrams-sme_Latn data_files: - split: train path: data/ngrams/sme_Latn/*.parquet - config_name: sentences-yor_Latn data_files: - split: train path: data/sentences/yor_Latn/*.parquet - config_name: paragraphs-yor_Latn data_files: - split: train path: data/paragraphs/yor_Latn/*.parquet - config_name: words-yor_Latn data_files: - split: train path: data/words/yor_Latn/*.parquet - config_name: ngrams-yor_Latn data_files: - split: train path: data/ngrams/yor_Latn/*.parquet - config_name: sentences-nds_Latn data_files: - split: train path: data/sentences/nds_Latn/*.parquet - config_name: paragraphs-nds_Latn data_files: - split: train path: data/paragraphs/nds_Latn/*.parquet - config_name: words-nds_Latn data_files: - split: train path: data/words/nds_Latn/*.parquet - config_name: ngrams-nds_Latn data_files: - split: train path: data/ngrams/nds_Latn/*.parquet - config_name: sentences-san_Deva data_files: - split: train path: data/sentences/san_Deva/*.parquet - config_name: paragraphs-san_Deva data_files: - split: train path: data/paragraphs/san_Deva/*.parquet - config_name: words-san_Deva data_files: - split: train path: data/words/san_Deva/*.parquet - config_name: ngrams-san_Deva data_files: - split: train path: data/ngrams/san_Deva/*.parquet - config_name: sentences-gsw_Latn data_files: - split: train path: data/sentences/gsw_Latn/*.parquet - config_name: paragraphs-gsw_Latn data_files: - split: train path: data/paragraphs/gsw_Latn/*.parquet - config_name: words-gsw_Latn data_files: - split: train path: data/words/gsw_Latn/*.parquet - config_name: ngrams-gsw_Latn data_files: - split: train path: data/ngrams/gsw_Latn/*.parquet - config_name: sentences-bod_Tibt data_files: - split: train path: data/sentences/bod_Tibt/*.parquet - config_name: paragraphs-bod_Tibt data_files: - split: train path: data/paragraphs/bod_Tibt/*.parquet - config_name: words-bod_Tibt data_files: - split: train path: data/words/bod_Tibt/*.parquet - config_name: ngrams-bod_Tibt data_files: - split: train path: data/ngrams/bod_Tibt/*.parquet - config_name: sentences-hyw_Armn data_files: - split: train path: data/sentences/hyw_Armn/*.parquet - config_name: paragraphs-hyw_Armn data_files: - split: train path: data/paragraphs/hyw_Armn/*.parquet - config_name: words-hyw_Armn data_files: - split: train path: data/words/hyw_Armn/*.parquet - config_name: ngrams-hyw_Armn data_files: - split: train path: data/ngrams/hyw_Armn/*.parquet - config_name: sentences-urd_Latn data_files: - split: train path: data/sentences/urd_Latn/*.parquet - config_name: paragraphs-urd_Latn data_files: - split: train path: data/paragraphs/urd_Latn/*.parquet - config_name: words-urd_Latn data_files: - split: train path: data/words/urd_Latn/*.parquet - config_name: ngrams-urd_Latn data_files: - split: train path: data/ngrams/urd_Latn/*.parquet - config_name: sentences-ast_Latn data_files: - split: train path: data/sentences/ast_Latn/*.parquet - config_name: paragraphs-ast_Latn data_files: - split: train path: data/paragraphs/ast_Latn/*.parquet - config_name: words-ast_Latn data_files: - split: train path: data/words/ast_Latn/*.parquet - config_name: ngrams-ast_Latn data_files: - split: train path: data/ngrams/ast_Latn/*.parquet - config_name: sentences-oci_Latn data_files: - split: train path: data/sentences/oci_Latn/*.parquet - config_name: paragraphs-oci_Latn data_files: - split: train path: data/paragraphs/oci_Latn/*.parquet - config_name: words-oci_Latn data_files: - split: train path: data/words/oci_Latn/*.parquet - config_name: ngrams-oci_Latn data_files: - split: train path: data/ngrams/oci_Latn/*.parquet - config_name: sentences-lus_Latn data_files: - split: train path: data/sentences/lus_Latn/*.parquet - config_name: paragraphs-lus_Latn data_files: - split: train path: data/paragraphs/lus_Latn/*.parquet - config_name: words-lus_Latn data_files: - split: train path: data/words/lus_Latn/*.parquet - config_name: ngrams-lus_Latn data_files: - split: train path: data/ngrams/lus_Latn/*.parquet - config_name: sentences-azb_Arab data_files: - split: train path: data/sentences/azb_Arab/*.parquet - config_name: paragraphs-azb_Arab data_files: - split: train path: data/paragraphs/azb_Arab/*.parquet - config_name: words-azb_Arab data_files: - split: train path: data/words/azb_Arab/*.parquet - config_name: ngrams-azb_Arab data_files: - split: train path: data/ngrams/azb_Arab/*.parquet - config_name: sentences-apc_Arab data_files: - split: train path: data/sentences/apc_Arab/*.parquet - config_name: paragraphs-apc_Arab data_files: - split: train path: data/paragraphs/apc_Arab/*.parquet - config_name: words-apc_Arab data_files: - split: train path: data/words/apc_Arab/*.parquet - config_name: ngrams-apc_Arab data_files: - split: train path: data/ngrams/apc_Arab/*.parquet - config_name: sentences-hbo_Hebr data_files: - split: train path: data/sentences/hbo_Hebr/*.parquet - config_name: paragraphs-hbo_Hebr data_files: - split: train path: data/paragraphs/hbo_Hebr/*.parquet - config_name: words-hbo_Hebr data_files: - split: train path: data/words/hbo_Hebr/*.parquet - config_name: ngrams-hbo_Hebr data_files: - split: train path: data/ngrams/hbo_Hebr/*.parquet - config_name: sentences-rue_Cyrl data_files: - split: train path: data/sentences/rue_Cyrl/*.parquet - config_name: paragraphs-rue_Cyrl data_files: - split: train path: data/paragraphs/rue_Cyrl/*.parquet - config_name: words-rue_Cyrl data_files: - split: train path: data/words/rue_Cyrl/*.parquet - config_name: ngrams-rue_Cyrl data_files: - split: train path: data/ngrams/rue_Cyrl/*.parquet - config_name: sentences-bar_Latn data_files: - split: train path: data/sentences/bar_Latn/*.parquet - config_name: paragraphs-bar_Latn data_files: - split: train path: data/paragraphs/bar_Latn/*.parquet - config_name: words-bar_Latn data_files: - split: train path: data/words/bar_Latn/*.parquet - config_name: ngrams-bar_Latn data_files: - split: train path: data/ngrams/bar_Latn/*.parquet - config_name: sentences-anp_Deva data_files: - split: train path: data/sentences/anp_Deva/*.parquet - config_name: paragraphs-anp_Deva data_files: - split: train path: data/paragraphs/anp_Deva/*.parquet - config_name: words-anp_Deva data_files: - split: train path: data/words/anp_Deva/*.parquet - config_name: ngrams-anp_Deva data_files: - split: train path: data/ngrams/anp_Deva/*.parquet --- # FineWeb-2 NLP **494,036,544 sentences** and **3,437,724,582 word tokens** across **1231 languages**, extracted from **14,400,889 source documents** (23.5 GB source data) in [FineWeb-2](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2). Every sentence, paragraph, word frequency, and n-gram frequency, split with language-aware segmentation and continuously updated. ## Table of Contents - [What is this?](#what-is-this) - [What is being released?](#what-is-being-released) - [Data organization](#data-organization) - [Sentence distribution by language](#sentence-distribution-by-language) - [Paragraph distribution by language](#paragraph-distribution-by-language) - [Splitting quality overview](#splitting-quality-overview) - [How to download and use this dataset](#how-to-download-and-use-this-dataset) - [Dataset statistics](#dataset-statistics) - [How it works](#how-it-works) - [Splitting methodology](#splitting-methodology) - [Dataset card](#dataset-card) --- ## What is this? [FineWeb-2](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2) is HuggingFace's multilingual web text corpus. It contains approximately 5 billion documents totaling 20 TB of text, drawn from roughly 100 Common Crawl snapshots spanning 2013 to 2024, and covering **1,868 language-script pairs**. It is the largest curated multilingual web corpus publicly available today. Working directly with FineWeb-2 is challenging. The raw data is enormous, and common NLP tasks like sentence extraction, word frequency analysis, or n-gram computation require downloading and processing terabytes of parquet files. Most researchers need just one language, or just the sentences, or just the word frequencies. They should not have to process the entire corpus to get there. **FineWeb-2 NLP** solves this by pre-segmenting every document in FineWeb-2 into four linguistically useful units: | Type | Rows | What you get | |------|------|-------------| | **sentences** | 494,036,544 | One row per sentence, with source document ID, URL, and position index | | **paragraphs** | 15,573,145 | One row per paragraph, with sentence count per paragraph | | **words** | 308,780,287 | Per-shard word frequency and document frequency tables | | **ngrams** | 8,565,033,040 | Per-shard bigram through 5-gram frequency tables | Every row traces back to its source document through `doc_id` and `doc_url` fields, making it possible to navigate from any sentence or word back to the original web page. This traceability is important for research that needs to verify context, check for contamination, or build training sets with known provenance. ### Why per-shard frequency tables? Words and n-grams are computed **per source shard** rather than aggregated into a single global table for each language. This design choice is intentional: some languages in FineWeb-2 contain over 700 million documents, and building a single frequency table for that volume would require holding hundreds of millions of unique entries in memory simultaneously. By keeping frequencies per-shard, each output file stays small and self-contained. Aggregation is straightforward. A single DuckDB query can combine all shards for a language in seconds: ```sql -- Language-level word frequencies in one query SELECT word, sum(frequency) as total_freq, sum(doc_frequency) as total_doc_freq FROM 'hf://datasets/open-index/fineweb-2-nlp/data/words/lat_Latn/*.parquet' GROUP BY word ORDER BY total_freq DESC LIMIT 100; ``` ## What is being released? Four dataset configs, all stored as Snappy-compressed Parquet files: ### 1. Sentences (`config_name: sentences`) | Column | Type | Description | |--------|------|-------------| | `sentence` | string | The extracted sentence | | `doc_id` | string | Source document UUID from FineWeb-2 | | `doc_url` | string | Original web page URL | | `position` | int32 | 0-based sentence index within the document | | `length` | int32 | Sentence length in UTF-8 bytes (equal to `LENGTH(sentence)`) | | `language` | string | ISO 639-3 language code (e.g. `lat`, `vie`, `cmn`) | | `language_script` | string | ISO 15924 script (e.g. `Latn`, `Hani`, `Cyrl`) | ### 2. Paragraphs (`config_name: paragraphs`) | Column | Type | Description | |--------|------|-------------| | `paragraph` | string | The paragraph text | | `doc_id` | string | Source document UUID | | `doc_url` | string | Original web page URL | | `position` | int32 | 0-based paragraph index within the document | | `length` | int32 | Paragraph length in UTF-8 bytes (equal to `LENGTH(paragraph)`) | | `language` | string | ISO 639-3 code | | `language_script` | string | ISO 15924 script | | `sentence_count` | int32 | Number of sentences detected in this paragraph | ### 3. Words (`config_name: words`) | Column | Type | Description | |--------|------|-------------| | `word` | string | Lowercased, NFC-normalized word | | `frequency` | int64 | Occurrence count within this shard | | `doc_frequency` | int64 | Documents containing this word (within shard) | | `language` | string | ISO 639-3 code | | `language_script` | string | ISO 15924 script | ### 4. N-grams (`config_name: ngrams`) | Column | Type | Description | |--------|------|-------------| | `ngram` | string | Space-joined n-gram (e.g. "of the", "in the world") | | `n` | int32 | N-gram size: 2 (bigram), 3 (trigram), 4, or 5 | | `frequency` | int64 | Occurrence count within this shard | | `language` | string | ISO 639-3 code | | `language_script` | string | ISO 15924 script | ## Data organization ``` open-index/fineweb-2-nlp/ ├── README.md ├── stats.csv └── data/ ├── sentences/ │ ├── lat_Latn/ │ │ └── 0000.parquet │ ├── vie_Latn/ │ │ ├── 0000.parquet │ │ └── ... │ └── {lang_script}/ │ └── {shard:04d}.parquet ├── paragraphs/ │ └── {lang_script}/{shard:04d}.parquet ├── words/ │ └── {lang_script}/{shard:04d}.parquet └── ngrams/ └── {lang_script}/{shard:04d}.parquet ``` Each source FineWeb-2 shard maps to exactly **one output file per type per language**. Shard names are zero-padded four-digit integers (`0000`, `0001`, ...) that match the source file ordering from HuggingFace. ## Sentence distribution by language ``` vie_Latn ████████████████████████████████████████ 93,969,314 ars_Arab █████████ 22,484,752 amh_Ethi █████ 13,919,743 epo_Latn █████ 13,638,952 tat_Cyrl ████ 11,464,138 hif_Latn ████ 11,208,693 xho_Latn ████ 11,170,151 ltz_Latn ████ 10,659,797 gmh_Latn ████ 9,691,929 plt_Latn ███ 8,686,135 gla_Latn ███ 8,395,608 jav_Latn ███ 8,390,033 fao_Latn ███ 7,630,863 fry_Latn ███ 7,608,754 yue_Hani ███ 7,079,607 hat_Latn ██ 7,028,779 tuk_Latn ██ 6,621,366 pap_Latn ██ 6,356,844 asm_Beng ██ 6,061,361 ceb_Latn ██ 6,042,623 lao_Laoo ██ 5,920,242 bak_Cyrl ██ 5,855,630 kin_Latn ██ 5,651,597 mri_Latn ██ 5,464,912 mww_Latn ██ 5,194,664 zul_Latn █ 4,676,895 snd_Arab █ 4,406,427 sun_Latn █ 4,266,819 cos_Latn █ 4,008,303 nya_Latn █ 3,987,892 ``` <details> <summary>SQL to reproduce this chart</summary> ```sql SELECT language || '_' || language_script as lang, count(*) as sentences FROM 'hf://datasets/open-index/fineweb-2-nlp/data/sentences/**/*.parquet' GROUP BY lang ORDER BY sentences DESC LIMIT 30; ``` </details> ## Paragraph distribution by language ``` vie_Latn ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ 2,471,414 ars_Arab ▓▓▓▓▓▓▓▓▓ 611,932 amh_Ethi ▓▓▓▓▓▓▓ 440,312 ltz_Latn ▓▓▓▓▓▓ 376,315 lao_Laoo ▓▓▓▓▓ 365,650 epo_Latn ▓▓▓▓▓ 363,049 fry_Latn ▓▓▓▓▓ 362,914 div_Thaa ▓▓▓▓▓ 349,335 kin_Latn ▓▓▓▓▓ 326,349 yue_Hani ▓▓▓▓▓ 321,379 fao_Latn ▓▓▓▓ 307,557 plt_Latn ▓▓▓▓ 304,804 asm_Beng ▓▓▓▓ 268,257 snd_Arab ▓▓▓▓ 260,735 xho_Latn ▓▓▓▓ 257,612 tuk_Latn ▓▓▓ 243,357 hat_Latn ▓▓▓ 238,512 gla_Latn ▓▓▓ 230,548 ceb_Latn ▓▓▓ 208,880 jav_Latn ▓▓▓ 194,752 ``` <details> <summary>SQL to reproduce this chart</summary> ```sql SELECT language || '_' || language_script as lang, count(*) as paragraphs FROM 'hf://datasets/open-index/fineweb-2-nlp/data/paragraphs/**/*.parquet' GROUP BY lang ORDER BY paragraphs DESC LIMIT 20; ``` </details> ## Splitting quality overview ``` ade_Latn ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 386.5 sent/doc swg_Latn ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 302.5 sent/doc tuk_Cyrl ░░░░░░░░░░░░░░░░░░░░░░░░░░░ 267.7 sent/doc crs_Latn ░░░░░░░░░░░░░░░░░░░░░░░░░░░ 265.1 sent/doc dak_Latn ░░░░░░░░░░░░░░░░░░░░░░░░ 232.8 sent/doc non_Latn ░░░░░░░░░░░░░░░░░░░░░░░ 226.5 sent/doc fro_Latn ░░░░░░░░░░░░░░░░░░░░░░ 212.9 sent/doc pkb_Latn ░░░░░░░░░░░░░░░░░░░░ 197.3 sent/doc lem_Latn ░░░░░░░░░░░░░░░░░░░ 186.6 sent/doc san_Latn ░░░░░░░░░░░░░░░░░░ 181.0 sent/doc wob_Latn ░░░░░░░░░░░░░░░░░ 167.1 sent/doc guh_Latn ░░░░░░░░░░░░░░░ 154.4 sent/doc lzh_Hani ░░░░░░░░░░░░░░░ 151.4 sent/doc rmn_Grek ░░░░░░░░░░░░░░░ 150.7 sent/doc esk_Latn ░░░░░░░░░░░░░░ 144.9 sent/doc quh_Latn ░░░░░░░░░░░░░░ 136.2 sent/doc txu_Latn ░░░░░░░░░░░░ 120.4 sent/doc byr_Latn ░░░░░░░░░░░░ 116.9 sent/doc ian_Latn ░░░░░░░░░░░░ 116.6 sent/doc yss_Latn ░░░░░░░░░░░ 115.2 sent/doc ``` The chart above shows the average number of sentences extracted per source document for each language. This metric serves as a rough proxy for content quality and structural richness. Languages where the average is high tend to contain longer, well-structured articles with clear paragraph and sentence boundaries. Languages with lower averages typically have shorter source documents, or they use scripts and punctuation patterns where automatic sentence boundary detection is more difficult. ## How to download and use this dataset ### 1. DuckDB (recommended for exploration) DuckDB can query HuggingFace parquet files directly over HTTP without downloading anything to disk. This makes it the fastest way to explore the dataset. ```sql -- Count sentences per language SELECT language, language_script, count(*) as sentences FROM 'hf://datasets/open-index/fineweb-2-nlp/data/sentences/**/*.parquet' GROUP BY ALL ORDER BY sentences DESC; -- Read Latin sentences SELECT sentence, doc_url FROM 'hf://datasets/open-index/fineweb-2-nlp/data/sentences/lat_Latn/*.parquet' LIMIT 20; -- Top 100 most frequent words in a language SELECT word, frequency, doc_frequency FROM 'hf://datasets/open-index/fineweb-2-nlp/data/words/vie_Latn/*.parquet' ORDER BY frequency DESC LIMIT 100; -- Most common bigrams in Latin SELECT ngram, frequency FROM 'hf://datasets/open-index/fineweb-2-nlp/data/ngrams/lat_Latn/*.parquet' WHERE n = 2 ORDER BY frequency DESC LIMIT 50; -- Average sentences per document per language SELECT language_script, count(DISTINCT doc_id) as docs, count(*) as sentences, round(count(*) * 1.0 / count(DISTINCT doc_id), 1) as avg_sent_per_doc FROM 'hf://datasets/open-index/fineweb-2-nlp/data/sentences/**/*.parquet' GROUP BY language_script ORDER BY sentences DESC LIMIT 20; -- Aggregate word frequencies across all shards SELECT word, sum(frequency) as total_freq FROM 'hf://datasets/open-index/fineweb-2-nlp/data/words/lat_Latn/*.parquet' GROUP BY word ORDER BY total_freq DESC LIMIT 50; -- Find sentences containing a specific word SELECT sentence, doc_url FROM 'hf://datasets/open-index/fineweb-2-nlp/data/sentences/lat_Latn/*.parquet' WHERE sentence ILIKE '%roma%' LIMIT 20; ``` ### 2. Python (datasets library) ```python from datasets import load_dataset # Stream all sentences (no full download needed) ds = load_dataset("open-index/fineweb-2-nlp", "sentences", split="train", streaming=True) for row in ds.take(10): print(f"[{row['language']}] {row['sentence'][:100]}") # Load paragraphs for a specific language ds = load_dataset("open-index/fineweb-2-nlp", "paragraphs", split="train", streaming=True) lat_paras = (row for row in ds if row["language"] == "lat") # Word frequencies ds = load_dataset("open-index/fineweb-2-nlp", "words", split="train", streaming=True) for row in ds.take(20): print(f"{row['word']:20s} freq={row['frequency']:>8,} doc_freq={row['doc_frequency']:>6,}") # N-gram analysis ds = load_dataset("open-index/fineweb-2-nlp", "ngrams", split="train", streaming=True) bigrams = (row for row in ds if row["n"] == 2) ``` ### 3. huggingface_hub CLI ```bash # Download all Latin sentences huggingface-cli download open-index/fineweb-2-nlp --include "data/sentences/lat_Latn/*" --repo-type dataset # Download Vietnamese words and ngrams huggingface-cli download open-index/fineweb-2-nlp --include "data/words/vie_Latn/*" "data/ngrams/vie_Latn/*" --repo-type dataset # Download everything for one language huggingface-cli download open-index/fineweb-2-nlp --include "data/*/lat_Latn/*" --repo-type dataset ``` ### 4. pandas + DuckDB ```python import duckdb conn = duckdb.connect() # Latin sentences as DataFrame df = conn.sql(""" SELECT sentence, doc_url, position FROM 'hf://datasets/open-index/fineweb-2-nlp/data/sentences/lat_Latn/*.parquet' LIMIT 1000 """).df() print(f"Loaded {len(df):,} sentences") print(df.head(10)) # Word frequency analysis words_df = conn.sql(""" SELECT word, sum(frequency) as total_freq FROM 'hf://datasets/open-index/fineweb-2-nlp/data/words/lat_Latn/*.parquet' GROUP BY word ORDER BY total_freq DESC LIMIT 200 """).df() print(words_df) ``` ## Dataset statistics | Metric | Value | |--------|-------| | **Total sentences** | **494,036,544** | | **Total paragraphs** | **15,573,145** | | **Total word tokens** | **3,437,724,582** | | **Unique word entries** (per-shard) | 308,780,287 | | **Total n-gram entries** (per-shard) | 8,565,033,040 | | **Languages processed** | **1231** | | **Source documents** | **14,400,889** | | **Source data processed** | **23.5 GB** | | **Output parquet size** | **181.8 GB** | | Avg sentence length | 139.4 chars | | Avg paragraph length | 4452.1 chars | | Avg sentences per document | 34.3 | | Avg paragraphs per document | 1.1 | | Avg sentences per paragraph | 31.7 | ### Per-language breakdown | # | Language | Sentences | Paragraphs | Words | Avg Sent | Avg Para | Docs | Shards | Source | Output | |---|----------|-----------|------------|-------|----------|----------|------|--------|--------|--------| | 1 | Vietnamese (`vie_Latn`) | 93,969,314 | 2,471,414 | 0 | 134.8 | 5162.5 | 2,319,000 | 1 | 4.5 GB | 11.6 GB | | 2 | ars_Arab (`ars_Arab`) | 22,484,752 | 611,932 | 240,374,070 | 104.7 | 3881.4 | 298,167 | 1 | 777.7 MB | 17.3 GB | | 3 | Amharic (`amh_Ethi`) | 13,919,743 | 440,312 | 0 | 191.2 | 6075.4 | 428,373 | 1 | 848.5 MB | 2.5 GB | | 4 | Esperanto (`epo_Latn`) | 13,638,952 | 363,049 | 202,969,272 | 94.6 | 3590.5 | 335,993 | 1 | 568.4 MB | 12.0 GB | | 5 | Tatar (`tat_Cyrl`) | 11,464,138 | 168,780 | 0 | 142.0 | 9710.1 | 161,354 | 1 | 489.4 MB | 1.3 GB | | 6 | hif_Latn (`hif_Latn`) | 11,208,693 | 140,856 | 0 | 96.3 | 7744.6 | 132,560 | 1 | 431.6 MB | 1.2 GB | | 7 | Xhosa (`xho_Latn`) | 11,170,151 | 257,612 | 94,819,557 | 70.4 | 3093.6 | 254,164 | 1 | 275.6 MB | 5.7 GB | | 8 | Luxembourgish (`ltz_Latn`) | 10,659,797 | 376,315 | 0 | 97.6 | 2793.0 | 354,553 | 1 | 468.0 MB | 1.2 GB | | 9 | gmh_Latn (`gmh_Latn`) | 9,691,929 | 86,529 | 0 | 376.9 | 42324.3 | 84,495 | 1 | 1.3 GB | 3.4 GB | | 10 | plt_Latn (`plt_Latn`) | 8,686,135 | 304,804 | 136,924,986 | 110.0 | 3161.9 | 272,871 | 1 | 365.8 MB | 7.6 GB | | 11 | Scottish Gaelic (`gla_Latn`) | 8,395,608 | 230,548 | 149,364,064 | 110.4 | 4055.4 | 222,468 | 1 | 348.4 MB | 7.3 GB | | 12 | Javanese (`jav_Latn`) | 8,390,033 | 194,752 | 117,949,223 | 96.9 | 4217.1 | 184,561 | 1 | 316.0 MB | 6.7 GB | | 13 | Faroese (`fao_Latn`) | 7,630,863 | 307,557 | 88,300,461 | 76.5 | 1922.8 | 291,151 | 1 | 270.2 MB | 5.3 GB | | 14 | Western Frisian (`fry_Latn`) | 7,608,754 | 362,914 | 0 | 90.2 | 1911.5 | 349,743 | 1 | 316.8 MB | 845.4 MB | | 15 | yue_Hani (`yue_Hani`) | 7,079,607 | 321,379 | 196,289,292 | 94.6 | 2096.3 | 314,951 | 1 | 430.5 MB | 8.0 GB | | 16 | Haitian Creole (`hat_Latn`) | 7,028,779 | 238,512 | 120,217,049 | 92.5 | 2754.8 | 224,472 | 1 | 283.2 MB | 5.8 GB | | 17 | Turkmen (`tuk_Latn`) | 6,621,366 | 243,357 | 82,201,211 | 110.4 | 3028.9 | 238,155 | 1 | 295.2 MB | 6.1 GB | | 18 | Papiamento (`pap_Latn`) | 6,356,844 | 190,770 | 97,107,091 | 82.9 | 2794.3 | 181,759 | 1 | 230.2 MB | 4.8 GB | | 19 | Assamese (`asm_Beng`) | 6,061,361 | 268,257 | 206,708,478 | 244.9 | 5553.2 | 267,371 | 1 | 367.5 MB | 6.1 GB | | 20 | Cebuano (`ceb_Latn`) | 6,042,623 | 208,880 | 114,780,722 | 113.6 | 3313.0 | 204,636 | 1 | 265.8 MB | 5.5 GB | | 21 | Lao (`lao_Laoo`) | 5,920,242 | 365,650 | 0 | 392.2 | 6365.7 | 359,623 | 1 | 563.8 MB | 1.5 GB | | 22 | Bashkir (`bak_Cyrl`) | 5,855,630 | 188,955 | 66,144,210 | 150.2 | 4684.6 | 183,068 | 1 | 276.5 MB | 5.7 GB | | 23 | Kinyarwanda (`kin_Latn`) | 5,651,597 | 326,349 | 114,611,731 | 146.2 | 2548.9 | 326,120 | 1 | 351.1 MB | 7.2 GB | | 24 | Maori (`mri_Latn`) | 5,464,912 | 185,306 | 108,134,455 | 97.8 | 2912.9 | 166,938 | 1 | 201.8 MB | 4.1 GB | | 25 | mww_Latn (`mww_Latn`) | 5,194,664 | 165,389 | 0 | 113.9 | 3607.8 | 158,808 | 1 | 209.0 MB | 551.9 MB | | 26 | Zulu (`zul_Latn`) | 4,676,895 | 130,444 | 0 | 108.3 | 3917.6 | 127,335 | 1 | 208.6 MB | 562.3 MB | | 27 | Sindhi (`snd_Arab`) | 4,406,427 | 260,735 | 145,693,284 | 288.4 | 4889.4 | 257,843 | 1 | 395.0 MB | 8.3 GB | | 28 | Sundanese (`sun_Latn`) | 4,266,819 | 109,653 | 0 | 104.2 | 4090.7 | 106,542 | 1 | 177.7 MB | 474.9 MB | | 29 | Corsican (`cos_Latn`) | 4,008,303 | 116,824 | 0 | 106.7 | 3694.7 | 111,036 | 1 | 176.5 MB | 468.5 MB | | 30 | Chichewa (`nya_Latn`) | 3,987,892 | 109,097 | 0 | 103.8 | 3830.7 | 103,045 | 1 | 159.6 MB | 426.6 MB | | 31 | nap_Latn (`nap_Latn`) | 3,771,811 | 75,691 | 0 | 43.0 | 2189.2 | 45,778 | 1 | 77.1 MB | 216.2 MB | | 32 | Samoan (`smo_Latn`) | 3,700,497 | 115,927 | 0 | 114.2 | 3675.3 | 110,674 | 1 | 152.9 MB | 413.7 MB | | 33 | Southern Sotho (`sot_Latn`) | 3,560,092 | 96,267 | 69,832,863 | 107.4 | 4009.3 | 92,492 | 1 | 141.4 MB | 437.6 MB | | 34 | Igbo (`ibo_Latn`) | 3,497,263 | 103,007 | 0 | 112.4 | 3848.4 | 98,785 | 1 | 143.1 MB | 376.5 MB | | 35 | Shona (`sna_Latn`) | 3,388,739 | 88,313 | 0 | 105.4 | 4080.2 | 84,381 | 1 | 143.3 MB | 386.3 MB | | 36 | sah_Cyrl (`sah_Cyrl`) | 3,278,551 | 76,504 | 0 | 155.5 | 6704.6 | 72,847 | 1 | 153.7 MB | 430.7 MB | | 37 | Hindi (`hin_Latn`) | 3,242,116 | 109,456 | 0 | 158.8 | 4731.1 | 97,024 | 1 | 189.3 MB | 531.5 MB | | 38 | Ossetian (`oss_Cyrl`) | 3,236,069 | 76,954 | 0 | 88.2 | 3751.9 | 63,690 | 1 | 83.4 MB | 243.9 MB | | 39 | Chuvash (`chv_Cyrl`) | 3,166,954 | 89,261 | 0 | 131.5 | 4701.1 | 81,380 | 1 | 132.7 MB | 356.8 MB | | 40 | Divehi (`div_Thaa`) | 3,107,250 | 349,335 | 0 | 480.3 | 4280.4 | 348,727 | 1 | 361.3 MB | 1016.2 MB | | 41 | Uyghur (`uig_Arab`) | 3,002,400 | 168,805 | 76,906,800 | 371.0 | 6614.7 | 165,637 | 1 | 314.2 MB | 7.0 GB | | 42 | haw_Latn (`haw_Latn`) | 2,873,005 | 98,993 | 0 | 120.7 | 3529.8 | 95,507 | 1 | 121.6 MB | 323.9 MB | | 43 | ydd_Hebr (`ydd_Hebr`) | 2,819,656 | 140,325 | 0 | 331.2 | 6674.2 | 135,116 | 1 | 259.3 MB | 704.4 MB | | 44 | sme_Latn (`sme_Latn`) | 2,692,543 | 82,649 | 0 | 68.3 | 2256.3 | 65,661 | 1 | 79.1 MB | 213.5 MB | | 45 | Yoruba (`yor_Latn`) | 2,552,792 | 80,759 | 0 | 119.2 | 3798.0 | 79,999 | 1 | 116.7 MB | 303.2 MB | | 46 | Low German (`nds_Latn`) | 2,512,337 | 85,151 | 0 | 72.2 | 2159.3 | 64,394 | 1 | 84.8 MB | 223.8 MB | | 47 | Sanskrit (`san_Deva`) | 2,450,273 | 23,647 | 0 | 142.2 | 14834.7 | 21,453 | 1 | 83.9 MB | 256.3 MB | | 48 | gsw_Latn (`gsw_Latn`) | 2,303,529 | 75,950 | 0 | 73.6 | 2260.9 | 58,314 | 1 | 86.8 MB | 227.7 MB | | 49 | Tibetan (`bod_Tibt`) | 2,299,148 | 162,441 | 0 | 898.3 | 12728.1 | 161,076 | 1 | 400.1 MB | 1.1 GB | | 50 | hyw_Armn (`hyw_Armn`) | 2,253,914 | 153,121 | 0 | 381.9 | 5634.6 | 151,767 | 1 | 252.3 MB | 678.6 MB | | 51 | Urdu (`urd_Latn`) | 2,166,912 | 71,898 | 0 | 132.0 | 4006.6 | 69,321 | 1 | 122.9 MB | 333.8 MB | | 52 | Asturian (`ast_Latn`) | 2,152,203 | 81,267 | 0 | 126.7 | 3381.9 | 71,329 | 1 | 118.9 MB | 314.1 MB | | 53 | Occitan (`oci_Latn`) | 1,920,830 | 75,494 | 0 | 101.7 | 2611.6 | 69,376 | 1 | 88.9 MB | 234.2 MB | | 54 | lus_Latn (`lus_Latn`) | 1,894,138 | 91,313 | 43,594,785 | 121.6 | 2542.8 | 90,564 | 1 | 97.3 MB | 299.9 MB | | 55 | azb_Arab (`azb_Arab`) | 1,861,323 | 104,443 | 0 | 186.8 | 3345.2 | 79,211 | 1 | 111.9 MB | 296.6 MB | | 56 | apc_Arab (`apc_Arab`) | 1,726,654 | 71,704 | 0 | 99.6 | 2422.4 | 69,740 | 1 | 62.2 MB | 171.2 MB | | 57 | hbo_Hebr (`hbo_Hebr`) | 1,716,258 | 47,260 | 0 | 209.3 | 7636.0 | 39,619 | 1 | 114.2 MB | 317.6 MB | | 58 | rue_Cyrl (`rue_Cyrl`) | 1,691,247 | 42,722 | 0 | 121.6 | 4851.4 | 39,923 | 1 | 69.6 MB | 200.4 MB | | 59 | Bavarian (`bar_Latn`) | 1,632,820 | 49,664 | 0 | 69.6 | 2321.2 | 37,025 | 1 | 56.1 MB | 153.2 MB | | 60 | anp_Deva (`anp_Deva`) | 1,628,437 | 59,925 | 0 | 201.3 | 5494.7 | 57,997 | 1 | 80.3 MB | 225.1 MB | ## How it works The pipeline is a single Go binary that walks every language-script partition FineWeb-2 publishes, splits the documents inside, and commits the results back to HuggingFace one shard at a time. The scale is what makes the design interesting: FineWeb-2 is 20 TB of text spread across 1,868 language-script partitions, with individual languages ranging from a single 10 MB shard up to over a hundred multi-gigabyte shards. Any stage that tries to hold more than one shard's worth of data in memory or on disk will eventually exhaust the machine. The core design choice is *process one shard end-to-end, persist nothing worth losing*. A shard is small enough to decompress into working memory, large enough to amortize the fixed cost of a HuggingFace commit, and self-contained enough that a crash mid-flight costs minutes rather than hours. Every other decision in the pipeline — the sequential download strategy, the lack of an external database, the refusal to batch commits across languages — flows from that principle. ### The stages **Download.** Source shards are pulled sequentially from HuggingFace over plain HTTP. We do not fan out parallel downloads: the split stage keeps the CPU saturated on its own, and parallel downloads would only invite rate limits without meaningfully shortening wall-clock time. Downloads are idempotent by file size, so a restart silently skips shards that are already fully on disk and re-pulls whatever was cut off mid-transfer. **Read.** Shards are streamed row-by-row via `parquet-go`, in batches of 10,000 rows when the words and n-grams configs are enabled and up to 50,000 rows when only sentences and paragraphs are being extracted. The batch size is not arbitrary: per-worker frequency maps scale roughly linearly with the batch size, and for Indo-European languages 10K rows × 6 workers × hundreds of tokens per document × four n-gram sizes is already enough data to stress a naive implementation. Reads are pipelined — the next batch is prefetched while the current one is being split — so there is no I/O stall between batches. **Split.** Each batch is sharded across worker goroutines (one per CPU) that run the language-aware segmentation logic described in the next section. This is where most of the multilingual complexity lives: sentence rules shift depending on the writing system of the document, word extraction runs under NFC normalization regardless of script, and CJK characters are individually tokenized because space-delimited word boundaries do not exist in Chinese, Japanese, or Korean. Workers keep thread-local frequency maps to avoid lock contention, and the maps are merged into per-shard totals only at batch boundaries. Frequency maps are pruned when they cross one million unique entries: rows with a count of one are evicted first, and lower-frequency rows follow if that is not enough. For low-resource languages this is almost never triggered — an entire small language may have only a few hundred thousand unique words across every shard combined. For high-volume languages, pruning keeps memory bounded without meaningfully distorting the distribution, because Zipf's law ensures that the discarded tail is dominated by typos, OCR artifacts, and hapax legomena that carry little analytical signal. **Write.** Sentences, paragraphs, words, and n-grams are written to four separate Snappy-compressed parquet files with 50,000 rows per row group. Snappy compresses text to roughly half its raw size and decompresses fast enough that DuckDB can scan the dataset at full HTTP bandwidth without the CPU becoming the bottleneck. We deliberately chose Snappy over Zstandard after benchmarking both: Zstandard produced noticeably smaller files but was significantly slower on the read path, and read throughput is what matters for a dataset meant to be queried over `hf://` URLs. Row groups of 50,000 rows keep metadata overhead low while remaining small enough for DuckDB's predicate pushdown to skip irrelevant groups when users filter by `language`, `language_script`, or `doc_url`. **Publish.** The four output files, a refreshed `stats.csv`, and a newly rendered `README.md` are committed to HuggingFace as a single LFS-aware commit. Either every file in the commit lands or none of them do, so a partial upload never leaves the dataset in a half-written state. HuggingFace rate limits are treated as first-class operational events. A 429 response honors the `Retry-After` header when present and falls back to a two-minute wait when it is not; other transient errors are retried with a linear backoff (30, 60, 90, 120, 150 seconds) up to five attempts. Beyond that, the shard is skipped for this run and will be retried on the next pipeline invocation — a consequence of keeping `stats.csv` as the only state of record. **Clean up.** After a successful publish, the source shard and the four output files are deleted. This is what lets the pipeline run indefinitely on a VM with 40–80 GB of free disk while processing tens of terabytes over the course of days. It also means `stats.csv` is the only signal that a shard has been completed — an absent output file is indistinguishable from one that never existed, and the stats file carries the full history. ### Resumability and state The pipeline keeps exactly one piece of durable state: `stats.csv`, which records every completed (language-script, shard) pair along with its counts and byte totals. On startup it reads the file, diffs the finished set against the list of source shards that still exist on HuggingFace, and starts working on the remainder. There is no database, no queue, no lock file, and no distributed coordination — just a flat CSV that happens to also be human-readable and checked into the published dataset. An earlier iteration used DuckDB for state tracking, which worked but added operational overhead: backups, schema migrations, the occasional recovery from a partially written database file. Falling back to CSV removed an entire category of failures and costs almost nothing in performance. Even with tens of thousands of rows, parsing the file at startup takes well under a second, and append-only writes are atomic at the OS level for small buffers. The same `stats.csv` is committed to the HuggingFace repo on every shard publish, which means the dataset itself is its own ledger. A fresh machine with no local state can clone the repo, read the CSV, and pick up exactly where the last machine left off. ### Resource budgets The pipeline runs comfortably inside these ceilings on a 4-core VM with 8 GB of RAM: | Resource | Budget | How | |----------|--------|-----| | **Memory** | ~200 MB resident | 10K-row parquet batches, frequency maps pruned at 1M entries | | **Disk** | ~10 GB peak | One shard in flight, deleted after successful publish | | **Network** | Sequential | One download and one commit at a time; backoff on 429 and 5xx | These budgets are intentionally conservative. When the pipeline falls over, it is almost always because of something external — a HuggingFace Hub incident, a transient DNS failure, an OOM from some other process on the same VM — and the design means those failures cost minutes of lost work rather than hours. ## Splitting methodology ### Sentence splitting Sentence segmentation is one of the harder problems in multilingual NLP. There is no universal rule for where sentences begin and end: different languages use different punctuation conventions, and web text frequently breaks the conventions of any language. Our approach uses a set of punctuation and casing heuristics tuned for web text across many scripts. The rules are designed to be conservative, preferring to keep text together rather than over-splitting. For short texts (under 500 characters), we use [sentencex](https://github.com/wikimedia/sentencex), a Wikimedia project that provides language-specific sentence boundary detection with knowledge of each language's abbreviation patterns and punctuation norms. | Rule | Example | Behavior | |------|---------|----------| | Period + space + uppercase | `world. The` | Split | | Abbreviation + period | `Mr. Smith` | No split | | Decimal number | `3.14 is` | No split | | Single-letter initial | `J. K. Rowling` | No split | | CJK fullstop | `世界。今天` | Always split | | Devanagari danda | `text। next` | Always split | | Exclamation/question | `really! What` | Split | | Newline after 10+ chars | `long text\nNext` | Split | For CJK languages (Chinese, Japanese, Korean), individual Han characters, Hiragana, Katakana, and Hangul syllables are each treated as separate word tokens, reflecting the character-level structure of these writing systems. This means that a Chinese sentence like "今天天气很好" produces six word tokens rather than being treated as a single unsplittable string. ### Word splitting Word extraction follows a straightforward pipeline designed to produce clean, normalized tokens suitable for frequency analysis: 1. NFC normalization (Unicode canonical composition) to ensure that equivalent character sequences are represented identically 2. Lowercase conversion for case-insensitive frequency counting 3. Splitting on non-letter, non-digit boundaries, while preserving apostrophes and hyphens that appear mid-word (e.g. "don't", "well-known") 4. Stripping of leading and trailing punctuation 5. Filtering of empty strings and pure-punctuation tokens ### Paragraph splitting FineWeb-2's source text comes from HTML pages processed by trafilatura, a web content extraction library. In trafilatura's output, HTML `<p>` tags are represented as double newlines (`\n\n`). We use this convention to split text into paragraphs: 1. Split on sequences of two or more consecutive newlines 2. Trim leading and trailing whitespace from each paragraph 3. Discard fragments shorter than 20 characters, which typically correspond to navigation elements, single-word headers, or other structural debris from the original HTML This simple approach works well in practice because trafilatura has already done the hard work of extracting meaningful content blocks from the HTML. ### N-gram extraction N-grams are extracted by sliding a window of size *n* over the word token sequence for each document. We compute bigrams (n=2), trigrams (n=3), 4-grams, and 5-grams. | N | Name | Example from "the quick brown fox" | |---|------|-------------------------------------| | 2 | Bigram | "the quick", "quick brown", "brown fox" | | 3 | Trigram | "the quick brown", "quick brown fox" | | 4 | 4-gram | "the quick brown fox" | | 5 | 5-gram | *(needs 5+ words)* | To keep memory usage bounded, per-shard frequency maps are pruned when they exceed 1 million unique entries. During pruning, entries with a frequency of 1 are evicted first. This means that very rare n-grams in large shards may be undercounted, but the most frequent and analytically useful n-grams are preserved accurately. ## Dataset card ### Dataset summary FineWeb-2 NLP provides pre-segmented versions of HuggingFace's [FineWeb-2](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2) dataset. Each of the approximately 5 billion source documents has been split into sentences, paragraphs, words, and n-grams using language-aware processing. The four resulting datasets share document IDs, so researchers can cross-reference between them: look up which sentences appear in a document, check the word frequencies for that language, or find which n-grams co-occur with a particular sentence. The primary goal is to lower the barrier to multilingual NLP research. Instead of downloading and processing 20 TB of raw text, researchers can query exactly the slice they need, whether that is all sentences in Latin, word frequencies in Vietnamese, or bigram distributions across every language in the corpus. ### Data instances **Sentence:** ```json { "sentence": "Gallia est omnis divisa in partes tres.", "doc_id": "f7ef49fc-6899-4d56-aaa7-bea5924802f3", "doc_url": "https://example.com/caesar", "position": 0, "language": "lat", "language_script": "Latn" } ``` **Word:** ```json { "word": "est", "frequency": 847, "doc_frequency": 412, "language": "lat", "language_script": "Latn" } ``` **N-gram:** ```json { "ngram": "in partes", "n": 2, "frequency": 23, "language": "lat", "language_script": "Latn" } ``` ### Curation rationale Sentence-level and word-level datasets are foundational for many areas of NLP research. They are used to train sentence embeddings, build and evaluate language models, study word frequency distributions and Zipf's law across languages, analyze collocations and phrasal patterns, and benchmark multilingual NLP tools. Having these units pre-extracted and ready to query saves researchers significant time and computational resources, and makes it practical to work with languages that might otherwise be overlooked due to the effort required to process the raw data. ### Source data All text originates from [FineWeb-2](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2) (DOI: [10.57967/hf/3744](https://doi.org/10.57967/hf/3744)). FineWeb-2 was constructed by extracting text from approximately 100 Common Crawl snapshots covering 2013 through 2024. The extraction pipeline includes text extraction via trafilatura, language identification using GlotLID, MinHash deduplication to remove near-duplicate documents, and adaptive quality filtering to remove low-quality content. We do not apply any additional filtering or deduplication beyond what FineWeb-2 provides. ### Considerations for using the data There are several important limitations to keep in mind when working with this dataset: **Low-resource language coverage.** Many of the smaller languages in FineWeb-2 consist primarily of Bible translations, Wikipedia mirrors, and religious texts. The FineWeb-2 authors note that over 70% of language-script pairs have more than 50% of their content from such sources. Word frequencies and n-gram distributions for these languages will reflect this narrow domain rather than general language use. **Sentence splitting accuracy.** The quality of sentence segmentation varies by language and script. Latin-script and CJK languages tend to produce the most accurate results, because their punctuation conventions are well-understood and widely standardized. Languages with less common scripts, or languages that use minimal punctuation, may have lower splitting accuracy. **Vietnamese word boundaries.** Vietnamese is written with spaces between syllables rather than between words. As a result, compound words like "học sinh" (student) are split into their component syllables "học" and "sinh" rather than being kept as a single token. This is a known limitation of whitespace-based word splitting for Vietnamese. **Per-shard word frequencies.** Word and n-gram frequencies are computed per source shard, not aggregated globally. To get language-level frequencies, aggregate with `sum(frequency) GROUP BY word` in DuckDB or any query engine that can read Parquet. **No additional PII filtering.** This dataset does not apply any personally identifiable information filtering beyond what was already done upstream by the FineWeb-2 team. Web text inherently contains names, email addresses, and other personal information. ### License [ODC-By 1.0](https://opendatacommons.org/licenses/by/1-0/) (Open Data Commons Attribution License), following FineWeb-2's license. ### Author Created by **Duc-Tam Nguyen** ([tamnd](https://huggingface.co/tamnd)) as part of the [open-index](https://huggingface.co/open-index) project. ### Citation ```bibtex @misc{fineweb2nlp2026, title = {FineWeb-2 NLP: Sentences, Paragraphs, Words, and N-grams}, author = {Nguyen, Duc-Tam}, year = {2026}, url = {https://huggingface.co/datasets/open-index/fineweb-2-nlp}, note = {Derived from FineWeb-2 (HuggingFaceFW/fineweb-2)} } @article{penedo2025fineweb2, title = {FineWeb2: One Pipeline to Scale Them All}, author = {Guilherme Penedo and others}, year = {2025}, eprint = {2506.20920}, archivePrefix = {arXiv} } ``` --- *Last updated: 2026-04-16 11:30 UTC*
提供机构:
open-index
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作