ray0rf1re/FineWeb-Nano
收藏Hugging Face2026-04-04 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/ray0rf1re/FineWeb-Nano
下载链接
链接失效反馈官方服务:
资源简介:
---
license: odc-by
language:
- en
size_categories:
- 10B<n<100B
---
# FineWeb-Nano
## Dataset Description
**FineWeb-Nano** is a highly curated, premium subset extracted from [`nampdn-ai/mini-fineweb`](https://huggingface.co/datasets/nampdn-ai/mini-fineweb).
### How "The Best" Was Determined
This dataset was created programmatically by streaming the original dataset and sorting chunks based on a rigorous quality scoring algorithm. The heuristic heavily favors:
1. High `language_score` (if provided by the upstream extraction).
2. Optimal document length (penalizing abnormally short snippets and excessively long, unformatted dumps).
3. Strong structural coherence suitable for pre-training Small Language Models (SLMs) and LLMs.
**FineWeb-Nano** represents the elite tier of data. It takes the **BEST 29.8 Gigabytes** directly from the top percentiles of the `Fineweb-Tiny` subset. This is the ultimate, hyper-distilled dataset for rapid model experimentation.
### License
This dataset is released under the **Open Data Commons Attribution License (ODC-By) v1.0** to perfectly match the upstream Fineweb mini licensing constraints.
提供机构:
ray0rf1re



