fineweb-edu-10M

Name: fineweb-edu-10M
Creator: maas
Published: 2025-12-05 16:55:13
License: 暂无描述

魔搭社区2025-12-05 更新2025-12-06 收录

下载链接：

https://modelscope.cn/datasets/codelion/fineweb-edu-10M

下载链接

链接失效反馈

官方服务：

资源简介：

## Sampling Methodology This dataset was created using **reservoir sampling**, a statistically unbiased random sampling algorithm that guarantees each sample from the source dataset has an equal probability of being included. This ensures the 10M token sample is representative of the full dataset's characteristics. **Source Dataset**: [HuggingFaceFW/fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) **Sample Size**: 10M tokens **Content**: Curated educational web resources Reservoir sampling enables rapid experimentation and ablation studies without processing the entire source dataset, while maintaining statistical validity of results. For details on how this dataset was used in optimal pre-training data composition research, see the [blog post](https://huggingface.co/blog/codelion/optimal-dataset-mixing/). ## Citation If you use this model/dataset, please cite: ```bibtex @article{sharma2025billion, title={The 1 Billion Token Challenge: Finding the Perfect Pre-training Mix}, author={Sharma, Asankhaya}, year={2025}, url={https://huggingface.co/blog/codelion/optimal-dataset-mixing/} } ``` For more details, see the [blog post](https://huggingface.co/blog/codelion/optimal-dataset-mixing/).

## 抽样方法本数据集采用**蓄水池抽样（reservoir sampling）**构建，这是一种统计无偏的随机抽样算法，可确保源数据集中的每个样本被纳入的概率均等。这使得1000万Token的样本能够代表完整数据集的特征。 **源数据集**：[HuggingFaceFW/fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) **样本规模**：1000万Token **内容**：精选教育类网络资源蓄水池抽样无需处理完整源数据集即可开展快速实验与消融研究，同时保证结果的统计有效性。 ## 引用若使用本模型/数据集，请引用： bibtex @article{sharma2025billion, title={"The 1 Billion Token Challenge: Finding the Perfect Pre-training Mix"}, author={Sharma, Asankhaya}, year={2025}, url={"https://huggingface.co/blog/codelion/optimal-dataset-mixing/"} } 更多详情请参见[博客文章](https://huggingface.co/blog/codelion/optimal-dataset-mixing/)。

提供机构：

maas

创建时间：

2025-10-22

5,000+

优质数据集

54 个

任务类型

进入经典数据集