fineweb-edu-10M
收藏魔搭社区2025-12-05 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/codelion/fineweb-edu-10M
下载链接
链接失效反馈官方服务:
资源简介:
## Sampling Methodology
This dataset was created using **reservoir sampling**, a statistically unbiased random sampling algorithm that guarantees each sample from the source dataset has an equal probability of being included. This ensures the 10M token sample is representative of the full dataset's characteristics.
**Source Dataset**: [HuggingFaceFW/fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu)
**Sample Size**: 10M tokens
**Content**: Curated educational web resources
Reservoir sampling enables rapid experimentation and ablation studies without processing the entire source dataset, while maintaining statistical validity of results.
For details on how this dataset was used in optimal pre-training data composition research, see the [blog post](https://huggingface.co/blog/codelion/optimal-dataset-mixing/).
## Citation
If you use this model/dataset, please cite:
```bibtex
@article{sharma2025billion,
title={The 1 Billion Token Challenge: Finding the Perfect Pre-training Mix},
author={Sharma, Asankhaya},
year={2025},
url={https://huggingface.co/blog/codelion/optimal-dataset-mixing/}
}
```
For more details, see the [blog post](https://huggingface.co/blog/codelion/optimal-dataset-mixing/).
## 抽样方法
本数据集采用**蓄水池抽样(reservoir sampling)**构建,这是一种统计无偏的随机抽样算法,可确保源数据集中的每个样本被纳入的概率均等。这使得1000万Token的样本能够代表完整数据集的特征。
**源数据集**:[HuggingFaceFW/fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu)
**样本规模**:1000万Token
**内容**:精选教育类网络资源
蓄水池抽样无需处理完整源数据集即可开展快速实验与消融研究,同时保证结果的统计有效性。
## 引用
若使用本模型/数据集,请引用:
bibtex
@article{sharma2025billion,
title={"The 1 Billion Token Challenge: Finding the Perfect Pre-training Mix"},
author={Sharma, Asankhaya},
year={2025},
url={"https://huggingface.co/blog/codelion/optimal-dataset-mixing/"}
}
更多详情请参见[博客文章](https://huggingface.co/blog/codelion/optimal-dataset-mixing/)。
提供机构:
maas
创建时间:
2025-10-22



