five

mdonigian/fineweb-edu-curated

收藏
Hugging Face2026-02-20 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/mdonigian/fineweb-edu-curated
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: odc-by task_categories: - text-generation language: - en size_categories: - 10B<n<100B tags: - curated - fineweb - education - stem --- # FineWeb-Edu Curated A curated subset of [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) optimised for training language models with strong STEM and structured-reasoning capabilities. ## Dataset Summary - **Total documents:** 3,656,053 - **Total tokens:** 4.3B - **Source:** FineWeb-Edu sample-100BT - **Curation method:** Multi-label topic classification + complexity scoring + distribution-aware sampling + MinHash deduplication ## Topic Distribution Documents are classified into 11 target groups using a 17-label multi-label classifier (ModernBERT-base, sigmoid threshold=0.3). STEM-core topics (Mathematics, Computer Science, ML/AI) are boosted relative to their natural distribution. | Group | Target % | Actual % | Tokens | Docs | |-------|----------|----------|--------|------| | Mathematics | 7.0% | 14.7% | 632,372,399 | 642,275 | | Computer Science | 8.0% | 16.9% | 730,556,604 | 704,479 | | ML/AI | 5.0% | 4.4% | 188,754,622 | 168,769 | | Physical Sciences | 4.0% | 14.2% | 612,268,635 | 525,241 | | Life Sciences | 3.0% | 21.9% | 947,146,778 | 810,167 | | Engineering/Tech | 5.0% | 24.3% | 1,047,041,226 | 925,919 | | Environmental Sci | 2.0% | 20.4% | 878,638,307 | 782,712 | | Medicine/Health | 4.0% | 21.4% | 921,433,959 | 804,158 | | Business/Economics | 4.0% | 24.3% | 1,048,966,632 | 837,069 | | Law/Government | 3.0% | 38.0% | 1,640,637,230 | 1,232,236 | | General Knowledge | 55.0% | 92.7% | 4,001,290,512 | 3,338,895 | ## Complexity Distribution Reasoning complexity scored by a ModernBERT-base regression model (1.0-4.0 scale). Mean complexity: 2.77, Median: 2.91. | Level | Range | Target % | Actual % | Docs | |-------|-------|----------|----------|------| | L1 | [1.0, 1.75) | 10.0% | 14.3% | 523,809 | | L2 | [1.75, 2.5) | 20.0% | 22.1% | 809,032 | | L3 | [2.5, 3.25) | 40.0% | 39.6% | 1,446,264 | | L4 | [3.25, 4.0) | 30.0% | 24.0% | 876,948 | ## Multi-Label Statistics - Mean labels per document: 4.0 - Documents with 2+ labels: 99.5% - Documents with 3+ labels: 90.7% ## Token Count Distribution | Percentile | Tokens | |------------|--------| | P10 | 265 | | P25 | 439 | | P50 (median) | 738 | | P75 | 1,260 | | P90 | 2,221 | | Mean | 1,180 | ## Top Domains | Domain | Tokens | % | |--------|--------|---| | en.wikipedia.org | 65,706,303 | 1.52% | | en.m.wikipedia.org | 10,121,881 | 0.23% | | slideplayer.com | 8,915,082 | 0.21% | | www.encyclopedia.com | 8,112,424 | 0.19% | | journals.plos.org | 6,990,412 | 0.16% | | en.wikisource.org | 6,889,207 | 0.16% | | www.nap.edu | 6,673,714 | 0.15% | | www.newworldencyclopedia.org | 6,555,558 | 0.15% | | www.reference.com | 6,136,223 | 0.14% | | link.springer.com | 5,912,585 | 0.14% | ## Schema Each row contains: | Field | Type | Description | |-------|------|-------------| | `text` | string | Document text | | `url` | string | Source URL | | `token_count` | int | Token count | | `dump` | string | Common Crawl dump identifier | | `topic_scores` | list[float] | 17-dim sigmoid scores from topic classifier | | `complexity` | float | Reasoning complexity score (1.0-4.0) | | `assigned_groups` | list[string] | Target groups at threshold=0.3 | | `relevance_score` | float | Composite relevance score used for sampling | ## Methodology 1. **Classification:** Both classifiers (topic: 17-label multi-label, complexity: regression) run on each document. Full sigmoid vectors stored for maximum flexibility. 2. **Filtering:** Priority-based sampling targeting STEM-core topics first, with complexity distribution targets within each group. Multi-label documents count toward multiple quotas. 3. **Deduplication:** MinHash LSH (128 perms, 13-gram word shingles, 0.7 Jaccard threshold). Highest-relevance document kept from each cluster.
提供机构:
mdonigian
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作