five

fineweb-edu-100M

收藏
魔搭社区2025-12-05 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/codelion/fineweb-edu-100M
下载链接
链接失效反馈
官方服务:
资源简介:
## Sampling Methodology This dataset was created using **reservoir sampling**, a statistically unbiased random sampling algorithm that guarantees each sample from the source dataset has an equal probability of being included. This ensures the 100M token sample is representative of the full dataset's characteristics. **Source Dataset**: [HuggingFaceFW/fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) **Sample Size**: 100M tokens **Content**: Curated educational web resources Reservoir sampling enables rapid experimentation and ablation studies without processing the entire source dataset, while maintaining statistical validity of results. For details on how this dataset was used in optimal pre-training data composition research, see the [blog post](https://huggingface.co/blog/codelion/optimal-dataset-mixing/). ## Citation If you use this model/dataset, please cite: ```bibtex @article{sharma2025billion, title={The 1 Billion Token Challenge: Finding the Perfect Pre-training Mix}, author={Sharma, Asankhaya}, year={2025}, url={https://huggingface.co/blog/codelion/optimal-dataset-mixing/} } ``` For more details, see the [blog post](https://huggingface.co/blog/codelion/optimal-dataset-mixing/).

# 采样方法 本数据集采用**蓄水池采样(reservoir sampling)**构建,这是一种统计无偏的随机采样算法,可确保源数据集中的每个样本被选中的概率均等。此举可保证该1亿Token(Token)样本能够反映完整数据集的整体特征。 **源数据集**:[HuggingFaceFW/fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) **采样规模**:1亿Token(Token) **内容**:精选教育类网络资源 蓄水池采样可在无需处理完整源数据集的前提下,支持快速实验与消融研究,同时保证实验结果的统计有效性。 如需了解该数据集在最优预训练数据组合研究中的具体应用方式,请参阅此[博客文章](https://huggingface.co/blog/codelion/optimal-dataset-mixing/)。 # 引用方式 若您使用此模型或数据集,请引用如下文献: bibtex @article{sharma2025billion, title={《10亿Token挑战:探寻最优预训练数据组合》}, author={夏尔马, 阿桑卡亚(Sharma, Asankhaya)}, year={2025}, url={https://huggingface.co/blog/codelion/optimal-dataset-mixing/} } 如需了解更多细节,请参阅此[博客文章](https://huggingface.co/blog/codelion/optimal-dataset-mixing/)。
提供机构:
maas
创建时间:
2025-10-22
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作