MegaMath-Web-Pro-Max
收藏魔搭社区2025-12-05 更新2025-07-05 收录
下载链接:
https://modelscope.cn/datasets/OctoThinker/MegaMath-Web-Pro-Max
下载链接
链接失效反馈官方服务:
资源简介:
# [OctoThinker: Mid-training Incentivizes Reinforcement Learning Scaling](https://arxiv.org/abs/2506.20512)

## The Curation of MegaMath-Web-Pro-Max
**Step 1**: Uniformly and randomly sample millions of documents from the MegaMath-Web corpus, stratified by publication year;
**Step 2**: Annotate them using Llama-3.1-70B-instruct with a scoring prompt from FineMath and prepare the seed data;
**Step 3**: Training a fasttext carefully with proper preprocessing;
**Step 4**: Filtering documents with a threshold (i.e., 0.4);
**Step 5**: Refine at scale using Llama-3.1-70B-instruct with a refinement prompt;
<!--  -->
<div style="display: flex; justify-content: center; gap: 20px;">
<img src="mm_web_pro_max_data_pipeline.png" alt="Data Pipeline" style="width:70%;">
</div>
## Demonstration of Data Quality (from pre/mid-training side)
Following MegaMath-Web’s yearly dump comparison setup (pick top 5B tokens from each year, then continual pre-training tinyllama and report the avg benchmark perf), we evaluate the quality of our recalled corpus under different thresholds, as shown in the Figure below. (Note that here, no data is refined by LLM and all are raw documents filtered from MegaMath)
<div style="display: flex; justify-content: center; gap: 20px;">
<img src="web_data_quality_comparison_yearly.png" alt="data quality" style="width:60%;">
</div>
## Demonstration of Data Quality (from RL side)
Mid-training on math web data improves performance over the base model, with MegaMath-Web-Pro and MegaMath-Web-Pro-Max showing slightly better gains than Finemath-4plus. After RL training, we find that mid-training on math web corpora improves RL
<div style="display: flex; justify-content: center; gap: 20px;">
<img src="data_quality_rl_side.png" alt="data quality" style="width:80%;">
</div>
## Citation
Check out our [paper](https://arxiv.org/abs/2506.20512) for more details. If you use our dataset or find our work useful, please cite
```
@article{wang2025octothinker,
title={OctoThinker: Mid-training Incentivizes Reinforcement Learning Scaling},
author={Wang, Zengzhi and Zhou, Fan and Li, Xuefeng and Liu, Pengfei},
year={2025},
journal={arXiv preprint arXiv:2506.20512},
note={Preprint}
}
```
# [OctoThinker: 训练中期激励强化学习扩展](https://arxiv.org/abs/2506.20512)

## MegaMath-Web-Pro-Max 数据集构建流程
**步骤1**:按照出版年份分层,从MegaMath-Web语料库中均匀随机采样百万级文档;
**步骤2**:使用来自FineMath的评分提示词,通过Llama-3.1-70B-instruct对文档进行标注,并构建种子数据集;
**步骤3**:经过恰当的预处理后,精细训练FastText模型;
**步骤4**:以阈值0.4对文档进行筛选;
**步骤5**:使用精炼提示词,通过Llama-3.1-70B-instruct进行大规模数据精炼;
<!--  -->
<div style="display: flex; justify-content: center; gap: 20px;">
<img src="mm_web_pro_max_data_pipeline.png" alt="数据流水线" style="width:70%;">
</div>
## 数据质量验证(预训练/训练中期阶段)
参考MegaMath-Web的年度数据导出对比方案(每年选取前50亿Token,随后对tinyllama进行持续预训练并报告基准测试平均性能),我们针对不同阈值下的召回语料质量进行了评估,结果如下图所示。(注:本次实验未经过大语言模型精炼,所有数据均为从MegaMath中筛选得到的原始文档)
<div style="display: flex; justify-content: center; gap: 20px;">
<img src="web_data_quality_comparison_yearly.png" alt="数据质量" style="width:60%;">
</div>
## 数据质量验证(强化学习阶段)
针对数学网页数据的训练中期微调可优于基础模型,其中MegaMath-Web-Pro与MegaMath-Web-Pro-Max的性能提升略优于Finemath-4plus。在完成强化学习训练后,我们发现对数学网页语料进行训练中期微调可有效提升强化学习性能。
<div style="display: flex; justify-content: center; gap: 20px;">
<img src="data_quality_rl_side.png" alt="数据质量" style="width:80%;">
</div>
## 引用
如需了解更多细节,请查阅我们的[论文](https://arxiv.org/abs/2506.20512)。若您使用本数据集或认为本工作对您有所帮助,请引用如下文献:
@article{wang2025octothinker,
title={OctoThinker: Mid-training Incentivizes Reinforcement Learning Scaling},
author={Wang, Zengzhi and Zhou, Fan and Li, Xuefeng and Liu, Pengfei},
year={2025},
journal={arXiv preprint arXiv:2506.20512},
note={预印本}
}
提供机构:
maas
创建时间:
2025-07-04



