fineweb-edu-1B
收藏魔搭社区2025-12-05 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/codelion/fineweb-edu-1B
下载链接
链接失效反馈官方服务:
资源简介:
## Sampling Methodology
This dataset was created using **reservoir sampling**, a statistically unbiased random sampling algorithm that guarantees each sample from the source dataset has an equal probability of being included. This ensures the 1B token sample is representative of the full dataset's characteristics.
**Source Dataset**: [HuggingFaceFW/fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu)
**Sample Size**: 1B tokens
**Content**: Curated educational web resources
Reservoir sampling enables rapid experimentation and ablation studies without processing the entire source dataset, while maintaining statistical validity of results.
For details on how this dataset was used in optimal pre-training data composition research, see the [blog post](https://huggingface.co/blog/codelion/optimal-dataset-mixing/).
## Citation
If you use this model/dataset, please cite:
```bibtex
@article{sharma2025billion,
title={The 1 Billion Token Challenge: Finding the Perfect Pre-training Mix},
author={Sharma, Asankhaya},
year={2025},
url={https://huggingface.co/blog/codelion/optimal-dataset-mixing/}
}
```
For more details, see the [blog post](https://huggingface.co/blog/codelion/optimal-dataset-mixing/).
# 采样方法
本数据集采用**蓄水池采样(reservoir sampling)**构建而成,这是一种统计无偏的随机采样算法,可确保源数据集中的每个样本均拥有均等的入选概率。此举可保证该10亿Token(Token)样本能够完整反映源数据集的整体特征。
**源数据集**:[HuggingFaceFW/fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu)
**采样规模**:10亿Token
**内容范畴**:经过精选的教育类网络资源
蓄水池采样无需处理完整源数据集即可支持快速实验与消融研究,同时保证实验结果具备统计有效性。
若需了解该数据集在最优预训练数据配比研究中的具体应用方式,请参阅[博客文章](https://huggingface.co/blog/codelion/optimal-dataset-mixing/)。
# 引用格式
若您在工作中使用本模型或数据集,请引用如下文献:
bibtex
@article{sharma2025billion,
title={The 1 Billion Token Challenge: Finding the Perfect Pre-training Mix},
author={Sharma, Asankhaya},
year={2025},
url={https://huggingface.co/blog/codelion/optimal-dataset-mixing/}
}
更多详情请参阅[博客文章](https://huggingface.co/blog/codelion/optimal-dataset-mixing/)。
提供机构:
maas
创建时间:
2025-10-22



