mdonigian/fineweb-edu-curated
收藏Hugging Face2026-02-20 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/mdonigian/fineweb-edu-curated
下载链接
链接失效反馈官方服务:
资源简介:
---
license: odc-by
task_categories:
- text-generation
language:
- en
size_categories:
- 10B<n<100B
tags:
- curated
- fineweb
- education
- stem
---
# FineWeb-Edu Curated
A curated subset of [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu)
optimised for training language models with strong STEM and structured-reasoning capabilities.
## Dataset Summary
- **Total documents:** 3,656,053
- **Total tokens:** 4.3B
- **Source:** FineWeb-Edu sample-100BT
- **Curation method:** Multi-label topic classification + complexity scoring + distribution-aware sampling + MinHash deduplication
## Topic Distribution
Documents are classified into 11 target groups using a 17-label multi-label classifier
(ModernBERT-base, sigmoid threshold=0.3). STEM-core topics (Mathematics,
Computer Science, ML/AI) are boosted relative to their natural distribution.
| Group | Target % | Actual % | Tokens | Docs |
|-------|----------|----------|--------|------|
| Mathematics | 7.0% | 14.7% | 632,372,399 | 642,275 |
| Computer Science | 8.0% | 16.9% | 730,556,604 | 704,479 |
| ML/AI | 5.0% | 4.4% | 188,754,622 | 168,769 |
| Physical Sciences | 4.0% | 14.2% | 612,268,635 | 525,241 |
| Life Sciences | 3.0% | 21.9% | 947,146,778 | 810,167 |
| Engineering/Tech | 5.0% | 24.3% | 1,047,041,226 | 925,919 |
| Environmental Sci | 2.0% | 20.4% | 878,638,307 | 782,712 |
| Medicine/Health | 4.0% | 21.4% | 921,433,959 | 804,158 |
| Business/Economics | 4.0% | 24.3% | 1,048,966,632 | 837,069 |
| Law/Government | 3.0% | 38.0% | 1,640,637,230 | 1,232,236 |
| General Knowledge | 55.0% | 92.7% | 4,001,290,512 | 3,338,895 |
## Complexity Distribution
Reasoning complexity scored by a ModernBERT-base regression model (1.0-4.0 scale).
Mean complexity: 2.77, Median: 2.91.
| Level | Range | Target % | Actual % | Docs |
|-------|-------|----------|----------|------|
| L1 | [1.0, 1.75) | 10.0% | 14.3% | 523,809 |
| L2 | [1.75, 2.5) | 20.0% | 22.1% | 809,032 |
| L3 | [2.5, 3.25) | 40.0% | 39.6% | 1,446,264 |
| L4 | [3.25, 4.0) | 30.0% | 24.0% | 876,948 |
## Multi-Label Statistics
- Mean labels per document: 4.0
- Documents with 2+ labels: 99.5%
- Documents with 3+ labels: 90.7%
## Token Count Distribution
| Percentile | Tokens |
|------------|--------|
| P10 | 265 |
| P25 | 439 |
| P50 (median) | 738 |
| P75 | 1,260 |
| P90 | 2,221 |
| Mean | 1,180 |
## Top Domains
| Domain | Tokens | % |
|--------|--------|---|
| en.wikipedia.org | 65,706,303 | 1.52% |
| en.m.wikipedia.org | 10,121,881 | 0.23% |
| slideplayer.com | 8,915,082 | 0.21% |
| www.encyclopedia.com | 8,112,424 | 0.19% |
| journals.plos.org | 6,990,412 | 0.16% |
| en.wikisource.org | 6,889,207 | 0.16% |
| www.nap.edu | 6,673,714 | 0.15% |
| www.newworldencyclopedia.org | 6,555,558 | 0.15% |
| www.reference.com | 6,136,223 | 0.14% |
| link.springer.com | 5,912,585 | 0.14% |
## Schema
Each row contains:
| Field | Type | Description |
|-------|------|-------------|
| `text` | string | Document text |
| `url` | string | Source URL |
| `token_count` | int | Token count |
| `dump` | string | Common Crawl dump identifier |
| `topic_scores` | list[float] | 17-dim sigmoid scores from topic classifier |
| `complexity` | float | Reasoning complexity score (1.0-4.0) |
| `assigned_groups` | list[string] | Target groups at threshold=0.3 |
| `relevance_score` | float | Composite relevance score used for sampling |
## Methodology
1. **Classification:** Both classifiers (topic: 17-label multi-label, complexity: regression)
run on each document. Full sigmoid vectors stored for maximum flexibility.
2. **Filtering:** Priority-based sampling targeting STEM-core topics first, with complexity
distribution targets within each group. Multi-label documents count toward multiple quotas.
3. **Deduplication:** MinHash LSH (128 perms, 13-gram word shingles, 0.7 Jaccard threshold).
Highest-relevance document kept from each cluster.
提供机构:
mdonigian



