five

ray0rf1re/Fineweb-Tiny

收藏
Hugging Face2026-04-04 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/ray0rf1re/Fineweb-Tiny
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: odc-by language: - en size_categories: - 100B<n<1T --- # Fineweb-Tiny ## Dataset Description **Fineweb-Tiny** is a highly curated, premium subset extracted from [`nampdn-ai/mini-fineweb`](https://huggingface.co/datasets/nampdn-ai/mini-fineweb). ### How "The Best" Was Determined This dataset was created programmatically by streaming the original dataset and sorting chunks based on a rigorous quality scoring algorithm. The heuristic heavily favors: 1. High `language_score` (if provided by the upstream extraction). 2. Optimal document length (penalizing abnormally short snippets and excessively long, unformatted dumps). 3. Strong structural coherence suitable for pre-training Small Language Models (SLMs) and LLMs. **Fineweb-Tiny** contains the absolute **BEST 72.9 Gigabytes** (compressed Parquet) of the original source. It drops the bottom 50% of lower-quality data from the stream to ensure dense, high-utility text. ### License This dataset is released under the **Open Data Commons Attribution License (ODC-By) v1.0** to perfectly match the upstream Fineweb mini licensing constraints.
提供机构:
ray0rf1re
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作