ray0rf1re/Fineweb-Tiny
收藏Hugging Face2026-04-04 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/ray0rf1re/Fineweb-Tiny
下载链接
链接失效反馈官方服务:
资源简介:
---
license: odc-by
language:
- en
size_categories:
- 100B<n<1T
---
# Fineweb-Tiny
## Dataset Description
**Fineweb-Tiny** is a highly curated, premium subset extracted from [`nampdn-ai/mini-fineweb`](https://huggingface.co/datasets/nampdn-ai/mini-fineweb).
### How "The Best" Was Determined
This dataset was created programmatically by streaming the original dataset and sorting chunks based on a rigorous quality scoring algorithm. The heuristic heavily favors:
1. High `language_score` (if provided by the upstream extraction).
2. Optimal document length (penalizing abnormally short snippets and excessively long, unformatted dumps).
3. Strong structural coherence suitable for pre-training Small Language Models (SLMs) and LLMs.
**Fineweb-Tiny** contains the absolute **BEST 72.9 Gigabytes** (compressed Parquet) of the original source. It drops the bottom 50% of lower-quality data from the stream to ensure dense, high-utility text.
### License
This dataset is released under the **Open Data Commons Attribution License (ODC-By) v1.0** to perfectly match the upstream Fineweb mini licensing constraints.
提供机构:
ray0rf1re



