MathCode-Pile
收藏arXiv2025-09-30 收录
下载链接:
https://github.com/mathllm/MathCoder2
下载链接
链接失效反馈官方服务:
资源简介:
该数据集是一个高质量的数学延续预训练数据集,它由与数学相关的网页数据、使用数学包的代码、数学教材以及合成数据构建而成,旨在提升大型语言模型中的数学推理能力。该数据集包含了自然语言推理步骤与相应代码的对偶配对,并且已开源,确保了透明性和可复现性。其规模达到了192亿个标记,任务聚焦于数学推理和代码生成。
This dataset is a high-quality mathematical continuation pre-training dataset, constructed from math-related web data, code utilizing mathematical packages, mathematics textbooks, and synthetic data. It aims to enhance the mathematical reasoning capabilities of large language models (LLMs). This dataset contains paired samples of natural language reasoning steps and their corresponding code, and it is open-sourced to ensure transparency and reproducibility. It has a scale of 19.2 billion tokens, with its tasks focusing on mathematical reasoning and code generation.



