five

bowang0911/fineweb-2-autocurate

收藏
Hugging Face2026-03-24 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/bowang0911/fineweb-2-autocurate
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: odc-by task_categories: - text-generation language: - da - sv - "no" - fi - nl - pl - tr - vi - cs tags: - fineweb - autocurate - data-quality - multilingual pretty_name: fineweb-2-autocurate configs: - config_name: dan_Latn data_files: - split: train path: data/dan_Latn/train/* - config_name: swe_Latn data_files: - split: train path: swe_Latn/train/* - config_name: nob_Latn data_files: - split: train path: nob_Latn/train/* - config_name: fin_Latn data_files: - split: train path: fin_Latn/train/* - config_name: nld_Latn data_files: - split: train path: nld_Latn/train/* - config_name: pol_Latn data_files: - split: train path: pol_Latn/train/* - config_name: ces_Latn data_files: - split: train path: ces_Latn/train/* - config_name: tur_Latn data_files: - split: train path: tur_Latn/train/* - config_name: vie_Latn data_files: - split: train path: vie_Latn/train/* --- # fineweb-2-autocurate Autonomously curated subsets of [HuggingFaceFW/fineweb-2](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2). An LLM agent iteratively samples documents, identifies quality problems, and proposes heuristic fixes. Each fix is validated by training a small language model for 5 minutes and measuring BPB improvement on a Wikipedia eval set. Only fixes that improve BPB are kept. Built with [autocurate](https://github.com/bwang-pplx/autocurate). ## Subsets | Language | Subset | Original Docs | Kept Docs | Kept % | BPB Before → After | Improvement | |---|---|---|---|---|---|---| | Danish | `dan_Latn` | 45.4M | 39.5M | 87% | 1.448 → 1.422 | -1.8% | | Swedish | `swe_Latn` | 59.5M | 53.9M | 91% | 1.443 → 1.437 | -0.4% | | Norwegian Bokmål | `nob_Latn` | 38.1M | 30.6M | 80% | 1.447 → 1.439 | -0.5% | | Finnish | `fin_Latn` | 36.7M | 33.0M | 90% | 1.386 → 1.374 | -0.8% | | Dutch | `nld_Latn` | 147.3M | 126.0M | 86% | 1.405 → 1.391 | -1.0% | | Polish | `pol_Latn` | 152.0M | 100.3M | 66% | 1.196 → 1.176 | -1.7% | | Czech | `ces_Latn` | 66.1M | 32.5M | 49% | 1.167 → 1.153 | -1.1% | | Turkish | `tur_Latn` | 95.1M | 83.9M | 88% | 1.117 → 1.111 | -0.6% | | Vietnamese | `vie_Latn` | 61.1M | 61.0M | 99.9% | 0.863 → 0.808 | -6.4% | ## Usage ```python from datasets import load_dataset ds = load_dataset("bowang0911/fineweb-2-autocurate", "dan_Latn", split="train") ``` ## Filters applied per language <details> <summary>Vietnamese (vie_Latn) — 1 filter</summary> - **Drop gambling/casino spam** — keyword density filter for betting and casino spam injected into articles </details> <details> <summary>Turkish (tur_Latn) — 2 cleaners, 3 filters</summary> - Strip aggregated unrelated content (news tickers, sidebars) - Strip truncated line endings - Drop incomplete/truncated text ending mid-sentence (2 rules) - Drop SEO spam keyword stuffing </details> <details> <summary>Czech (ces_Latn) — 2 cleaners, 4 filters</summary> - Strip truncated line endings (`...splňuje`, `...Pokračovat ve čtení`) - Strip boilerplate footers (`Můžete také zajímat`, `Zaregistrovat se`, `více »`) - Drop truncated articles containing `...` - Drop e-commerce pages (`Skladem`, `PPL`, `Koupit`, `Poštovné`) - Drop cookie/boilerplate pages (`využíváme soubory cookies`, `Copyright`) - Drop incomplete text (`číst dále`, `zobrazit více`) </details> <details> <summary>Polish (pol_Latn) — 2 cleaners, 3 filters</summary> - Strip footer boilerplate (`Ostatnie wiadomości`, `Newsletter`, `Zaloguj się`) - Strip cookie/UI noise (`Ta strona używa plików cookies`) - Drop truncated text (`...`, `czytaj dalej`, `zobacz więcej`) - Drop truncated text with abrupt cutoffs - Drop cookie banner documents </details> <details> <summary>Finnish (fin_Latn) — 1 cleaner, 5 filters</summary> - Strip link farm / footer leakage - Drop CLI flags and ellipsis patterns - Drop machine translation errors and gibberish - Drop OCR errors and garbled text - Drop grammar/syntax errors and morphological breakdowns - Drop SEO spam keyword stuffing </details> <details> <summary>Dutch (nld_Latn) — 2 cleaners, 3 filters</summary> - Strip incomplete sentence endings and text cutoffs - Strip boilerplate/navigation artifacts (footers, sidebars) - Drop machine translation artifacts and nonsensical syntax - Drop structural noise (UI artifacts, footer leakage) - Drop aggressive filter patterns </details> <details> <summary>Swedish (swe_Latn) — 1 cleaner, 3 filters</summary> - Strip truncated document endings - Drop adult content and SEO spam - Drop truncated/incomplete paragraphs (scraping errors, paywalls) - Drop adult content keyword stuffing (density filter) </details> <details> <summary>Norwegian Bokmål (nob_Latn) — 1 cleaner, 2 filters</summary> - Strip boilerplate/footer artifacts and platform noise - Drop boilerplate interface elements - Drop SEO spam keyword stuffing </details> <details> <summary>Danish (dan_Latn) — 5 rules</summary> - Drop short documents (< 500 chars / 50 words) - Drop documents with high adult/spam keyword density - Remove lines containing adult/NSFW keywords - Strip truncation artifacts ("Læs mere") - Drop SEO spam pages </details> ## Schema Same as fineweb-2 — all original columns preserved. The `text` column contains cleaned text. ## Method ``` Sample docs → LLM identifies problems → Propose fix → Train 5 min → BPB improved? → Keep / Revert ``` See [autocurate](https://github.com/bwang-pplx/autocurate) for details. ## License Same as fineweb-2: [ODC-BY](https://opendatacommons.org/licenses/by/1-0/).
提供机构:
bowang0911
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作