bowang0911/fineweb-2-autocurate
收藏Hugging Face2026-03-24 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/bowang0911/fineweb-2-autocurate
下载链接
链接失效反馈官方服务:
资源简介:
---
license: odc-by
task_categories:
- text-generation
language:
- da
- sv
- "no"
- fi
- nl
- pl
- tr
- vi
- cs
tags:
- fineweb
- autocurate
- data-quality
- multilingual
pretty_name: fineweb-2-autocurate
configs:
- config_name: dan_Latn
data_files:
- split: train
path: data/dan_Latn/train/*
- config_name: swe_Latn
data_files:
- split: train
path: swe_Latn/train/*
- config_name: nob_Latn
data_files:
- split: train
path: nob_Latn/train/*
- config_name: fin_Latn
data_files:
- split: train
path: fin_Latn/train/*
- config_name: nld_Latn
data_files:
- split: train
path: nld_Latn/train/*
- config_name: pol_Latn
data_files:
- split: train
path: pol_Latn/train/*
- config_name: ces_Latn
data_files:
- split: train
path: ces_Latn/train/*
- config_name: tur_Latn
data_files:
- split: train
path: tur_Latn/train/*
- config_name: vie_Latn
data_files:
- split: train
path: vie_Latn/train/*
---
# fineweb-2-autocurate
Autonomously curated subsets of [HuggingFaceFW/fineweb-2](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2).
An LLM agent iteratively samples documents, identifies quality problems, and proposes heuristic fixes. Each fix is validated by training a small language model for 5 minutes and measuring BPB improvement on a Wikipedia eval set. Only fixes that improve BPB are kept.
Built with [autocurate](https://github.com/bwang-pplx/autocurate).
## Subsets
| Language | Subset | Original Docs | Kept Docs | Kept % | BPB Before → After | Improvement |
|---|---|---|---|---|---|---|
| Danish | `dan_Latn` | 45.4M | 39.5M | 87% | 1.448 → 1.422 | -1.8% |
| Swedish | `swe_Latn` | 59.5M | 53.9M | 91% | 1.443 → 1.437 | -0.4% |
| Norwegian Bokmål | `nob_Latn` | 38.1M | 30.6M | 80% | 1.447 → 1.439 | -0.5% |
| Finnish | `fin_Latn` | 36.7M | 33.0M | 90% | 1.386 → 1.374 | -0.8% |
| Dutch | `nld_Latn` | 147.3M | 126.0M | 86% | 1.405 → 1.391 | -1.0% |
| Polish | `pol_Latn` | 152.0M | 100.3M | 66% | 1.196 → 1.176 | -1.7% |
| Czech | `ces_Latn` | 66.1M | 32.5M | 49% | 1.167 → 1.153 | -1.1% |
| Turkish | `tur_Latn` | 95.1M | 83.9M | 88% | 1.117 → 1.111 | -0.6% |
| Vietnamese | `vie_Latn` | 61.1M | 61.0M | 99.9% | 0.863 → 0.808 | -6.4% |
## Usage
```python
from datasets import load_dataset
ds = load_dataset("bowang0911/fineweb-2-autocurate", "dan_Latn", split="train")
```
## Filters applied per language
<details>
<summary>Vietnamese (vie_Latn) — 1 filter</summary>
- **Drop gambling/casino spam** — keyword density filter for betting and casino spam injected into articles
</details>
<details>
<summary>Turkish (tur_Latn) — 2 cleaners, 3 filters</summary>
- Strip aggregated unrelated content (news tickers, sidebars)
- Strip truncated line endings
- Drop incomplete/truncated text ending mid-sentence (2 rules)
- Drop SEO spam keyword stuffing
</details>
<details>
<summary>Czech (ces_Latn) — 2 cleaners, 4 filters</summary>
- Strip truncated line endings (`...splňuje`, `...Pokračovat ve čtení`)
- Strip boilerplate footers (`Můžete také zajímat`, `Zaregistrovat se`, `více »`)
- Drop truncated articles containing `...`
- Drop e-commerce pages (`Skladem`, `PPL`, `Koupit`, `Poštovné`)
- Drop cookie/boilerplate pages (`využíváme soubory cookies`, `Copyright`)
- Drop incomplete text (`číst dále`, `zobrazit více`)
</details>
<details>
<summary>Polish (pol_Latn) — 2 cleaners, 3 filters</summary>
- Strip footer boilerplate (`Ostatnie wiadomości`, `Newsletter`, `Zaloguj się`)
- Strip cookie/UI noise (`Ta strona używa plików cookies`)
- Drop truncated text (`...`, `czytaj dalej`, `zobacz więcej`)
- Drop truncated text with abrupt cutoffs
- Drop cookie banner documents
</details>
<details>
<summary>Finnish (fin_Latn) — 1 cleaner, 5 filters</summary>
- Strip link farm / footer leakage
- Drop CLI flags and ellipsis patterns
- Drop machine translation errors and gibberish
- Drop OCR errors and garbled text
- Drop grammar/syntax errors and morphological breakdowns
- Drop SEO spam keyword stuffing
</details>
<details>
<summary>Dutch (nld_Latn) — 2 cleaners, 3 filters</summary>
- Strip incomplete sentence endings and text cutoffs
- Strip boilerplate/navigation artifacts (footers, sidebars)
- Drop machine translation artifacts and nonsensical syntax
- Drop structural noise (UI artifacts, footer leakage)
- Drop aggressive filter patterns
</details>
<details>
<summary>Swedish (swe_Latn) — 1 cleaner, 3 filters</summary>
- Strip truncated document endings
- Drop adult content and SEO spam
- Drop truncated/incomplete paragraphs (scraping errors, paywalls)
- Drop adult content keyword stuffing (density filter)
</details>
<details>
<summary>Norwegian Bokmål (nob_Latn) — 1 cleaner, 2 filters</summary>
- Strip boilerplate/footer artifacts and platform noise
- Drop boilerplate interface elements
- Drop SEO spam keyword stuffing
</details>
<details>
<summary>Danish (dan_Latn) — 5 rules</summary>
- Drop short documents (< 500 chars / 50 words)
- Drop documents with high adult/spam keyword density
- Remove lines containing adult/NSFW keywords
- Strip truncation artifacts ("Læs mere")
- Drop SEO spam pages
</details>
## Schema
Same as fineweb-2 — all original columns preserved. The `text` column contains cleaned text.
## Method
```
Sample docs → LLM identifies problems → Propose fix → Train 5 min → BPB improved? → Keep / Revert
```
See [autocurate](https://github.com/bwang-pplx/autocurate) for details.
## License
Same as fineweb-2: [ODC-BY](https://opendatacommons.org/licenses/by/1-0/).
提供机构:
bowang0911



