MegaMath
收藏魔搭社区2026-01-06 更新2025-11-03 收录
下载链接:
https://modelscope.cn/datasets/LLM360/MegaMath
下载链接
链接失效反馈官方服务:
资源简介:
# MegaMath: Pushing the Limits of Open Math Copora
> Megamath is part of TxT360, curated by LLM360 Team.
<center><img src="teasor.png" alt="MegaMath Collection" /></center>
We introduce MegaMath, an open math pretraining dataset curated from diverse, math-focused sources, with over 300B tokens.
MegaMath is curated via the following three efforts:
- **Revisiting web data**:
We re-extracted mathematical documents from Common Crawl with math-oriented HTML optimizations, fasttext-based filtering and deduplication, all for acquiring higher-quality data on the Internet.
- **Recalling Math-related code data**:
We identified high quality math-related code from large code training corpus, Stack-V2, further enhancing data diversity.
- **Exploring Synthetic data**:
We synthesized QA-style text, math-related code, and interleaved text-code blocks from web data or code data.
## MegaMath Compared to Existing Datasets
MegaMath is the largest open math pre-training dataset to date, surpassing DeepSeekMath (120B) tokens.
<div style="display: flex; justify-content: center; gap: 20px;">
<img src="https://cdn-uploads.huggingface.co/production/uploads/628f6e5ab90dde28ef57d293/lFa_r4gSXhjwep7XAwwQj.png" width="75%" />
</div>
## MegaMath Delivers with High Quality
During development, we use extensive experiments to find optimal practice for text extraction, deduplication, fasttext training, etc. Training MegaMath data shows better performance than existing open datasets.
<div style="display: flex; justify-content: center; gap: 20px;">
<img src="https://cdn-uploads.huggingface.co/production/uploads/628f6e5ab90dde28ef57d293/-E1tZP-vbU1ZPzy56cl4s.png" width="30%" />
<img src="https://cdn-uploads.huggingface.co/production/uploads/628f6e5ab90dde28ef57d293/XSBJ_wVexM-0rk9bcpU5Q.png" width="30%" />
</div>
## Training MegaMath on Latest LMs
We also release two proof-of-concept models which is based on [Llama-3.2-1B](https://huggingface.co/meta-llama/Llama-3.2-1B) and [LLama-3.2-3B](https://huggingface.co/meta-llama/Llama-3.2-3B).
Training MegaMath on Llama-3.2-1B and LLama-3.2-3B brings about 15% ~ 20% performance boost on 10 downstream benchmarks, demonstrateing its high data quality.
<div style="display: flex; justify-content: center; gap: 20px;">
<img src="https://cdn-uploads.huggingface.co/production/uploads/628f6e5ab90dde28ef57d293/EIReQ8TIbyn1V3JfsEKiL.png" width="50%" />
</div>
## Detailed Statistics
| **Category** | **# Sample(M)** | **# Toks(B)** | **Avg. (# Toks)** |
|------------------------|----------------:|--------------:|------------------:|
| **Web Domain** | **121.5** | **279.0** | **2296.9** |
| Web | 106.5 | 263.9 | 2478.7 |
| Web-Pro | 15.0 | 15.1 | 1006.0 |
| **Code Domain** | **13.4** | **28.1** | **2102.7** |
| **Synthetic Data** | **80.2** | **64.5** | **804.5** |
| Translated Code | 7.4 | 7.2 | 979.5 |
| Q&A | 22.6 | 7.0 | 308.3 |
| Text&Code Block | 50.2 | 50.3 | 1002.1 |
| **Total** | **215.1** | **371.6** | **1727.6** |
## Citation
If you use our dataset or find our work useful, please cite
```bibtex
@article{zhou2025megamath,
title = {MegaMath: Pushing the Limits of Open Math Corpora},
author = {Zhou, Fan and Wang, Zengzhi and Ranjan, Nikhil and Cheng, Zhoujun and Tang, Liping and He, Guowei and Liu, Zhengzhong and Xing, Eric P.},
journal = {arXiv preprint arXiv:2504.02807},
year = {2025},
note = {Preprint}
}
```
# MegaMath:突破开源数学语料库的边界
> MegaMath 隶属于 TxT360 项目,由 LLM360 团队精心打造。
<center><img src="teasor.png" alt="MegaMath Collection" /></center>
我们推出了 MegaMath,这是一款从多样化数学专注型数据源中精选而来的开源数学预训练数据集,总 Token 数超过 3000 亿。
MegaMath 通过以下三项核心工作完成数据精选与构建:
- **重新梳理网页数据**:
我们通过面向数学的 HTML 优化、基于 fastText 的过滤与去重流程,从 Common Crawl 中重新提取数学文档,以获取互联网场景下的高质量数学数据。
- **召回数学相关代码数据**:
我们从大型代码训练语料库 Stack-V2 中筛选出高质量的数学相关代码,进一步丰富了数据集的多样性。
- **生成合成数据**:
我们基于网页数据或代码数据,生成了问答式文本、数学相关代码以及交错式文本-代码块数据。
## MegaMath 与现有数据集的对比
截至目前,MegaMath 是规模最大的开源数学预训练数据集,其 Token 数量远超 DeepSeekMath 的 1200 亿 Token。
<div style="display: flex; justify-content: center; gap: 20px;">
<img src="https://cdn-uploads.huggingface.co/production/uploads/628f6e5ab90dde28ef57d293/lFa_r4gSXhjwep7XAwwQj.png" width="75%" />
</div>
## MegaMath 具备高质量特性
在研发过程中,我们通过大量实验优化了文本提取、去重、fastText 训练等环节的最佳实践。基于 MegaMath 数据训练的模型,性能优于现有开源数据集训练的同类模型。
<div style="display: flex; justify-content: center; gap: 20px;">
<img src="https://cdn-uploads.huggingface.co/production/uploads/628f6e5ab90dde28ef57d293/-E1tZP-vbU1ZPzy56cl4s.png" width="30%" />
<img src="https://cdn-uploads.huggingface.co/production/uploads/628f6e5ab90dde28ef57d293/XSBJ_wVexM-0rk9bcpU5Q.png" width="30%" />
</div>
## 在最新大语言模型上的训练效果
我们还基于 [Llama-3.2-1B](https://huggingface.co/meta-llama/Llama-3.2-1B) 与 [Llama-3.2-3B](https://huggingface.co/meta-llama/Llama-3.2-3B) 发布了两款概念验证模型。在这两款模型上使用 MegaMath 进行训练,可在 10 个下游基准测试中实现约 15% 至 20% 的性能提升,验证了该数据集的高质量特性。
<div style="display: flex; justify-content: center; gap: 20px;">
<img src="https://cdn-uploads.huggingface.co/production/uploads/628f6e5ab90dde28ef57d293/EIReQ8TIbyn1V3JfsEKiL.png" width="50%" />
</div>
## 详细统计数据
| **类别** | **样本数(百万)** | **Token数(十亿)** | **平均Token数** |
|------------------------|----------------:|--------------:|------------------:|
| **网页域** | **121.5** | **279.0** | **2296.9** |
| 网页 | 106.5 | 263.9 | 2478.7 |
| Web-Pro | 15.0 | 15.1 | 1006.0 |
| **代码域** | **13.4** | **28.1** | **2102.7** |
| **合成数据** | **80.2** | **64.5** | **804.5** |
| 翻译代码 | 7.4 | 7.2 | 979.5 |
| 问答数据 | 22.6 | 7.0 | 308.3 |
| 文本-代码交错块 | 50.2 | 50.3 | 1002.1 |
| **总计** | **215.1** | **371.6** | **1727.6** |
## 引用格式
如果您使用了本数据集或认为本研究对您有所帮助,请引用以下文献:
bibtex
@article{zhou2025megamath,
title = {MegaMath: Pushing the Limits of Open Math Corpora},
author = {Zhou, Fan and Wang, Zengzhi and Ranjan, Nikhil and Cheng, Zhoujun and Tang, Liping and He, Guowei and Liu, Zhengzhong and Xing, Eric P.},
journal = {arXiv preprint arXiv:2504.02807},
year = {2025},
note = {Preprint}
}
提供机构:
maas
创建时间:
2025-09-26



