MathX-5M
收藏魔搭社区2026-05-12 更新2025-07-12 收录
下载链接:
https://modelscope.cn/datasets/XenArcAI/MathX-5M
下载链接
链接失效反馈官方服务:
资源简介:
# Modotte
---
<p align="center">
<img
src="https://cdn-uploads.huggingface.co/production/uploads/677fcdf29b9a9863eba3f29f/lmv_L59KWnn0lQjpogWKx.png"
alt="MathX-5M Banner"
width="80%"
style="border-radius:15px;"
/>
</p>
> Note : This datset is the part of a lineup MathX by Modotte you can get a lots of datasets on this same linup main focus is to provide very high quality datasets for model training
> and finetuning
This dataset is curated from high-quality public sources and enhanced with synthetic data from both closed and open-source models. It serves as a strong foundation for instruction-based model tuning and fine-tuning, offering one of the most refined and extensive corpora available.
### Important:
- Num of rows with Think: 1,178,145
- Num of rows with No Think: 3,871,855
### Key Features
- **Scale**: 5 million examples of highly curated step-by-step thinking data
- **Diversity**: Comprehensive coverage of mathematical domains from basic arithmetic to advanced calculus
- **Quality**: Multi-stage filtering and verification processes
- **Reasoning**: Step-by-step solutions with detailed mathematical thinking
- **Accuracy**: RL-verified answers with correctness validation
## Dataset Overview
**MathX** is a meticulously curated mathematical reasoning dataset designed specifically for instruction-based model tuning and fine-tuning of existing models with enhanced thinking capabilities. This represents the largest and most comprehensively filtered corpus of publicly available mathematical reasoning data.
## How to use?
```python
pip install -U datasets fsspec
```
```python
from datasets import load_dataset
dataset = load_dataset("Modotte/MathX-5M")
```
### Key Features
- **Scale**: 5 million examples of highly curated step-by-step thinking data
- **Diversity**: Comprehensive coverage of mathematical domains from basic arithmetic to advanced calculus
- **Quality**: Multi-stage filtering and verification processes
- **Reasoning**: Step-by-step solutions with detailed mathematical thinking
- **Accuracy**: RL-verified answers with correctness validation
## Data Curation Process
This dataset has been carefully constructed through a multi-source approach:
### Data Sources
- **High-Quality Existing Datasets**: Curated from multiple premium mathematical datasets available online(Nvidia, Openr1, Modotte)
- **Synthetic Generation**: Generated using both closed-source and open-source language models(Modotte)
- **Expert Validation**: Human-verified mathematical solutions and explanations(Modotte)
### Filtering Pipeline
Our rigorous filtering process includes:
1. **Deduplication**: Removal of duplicate problems and solutions
2. **Normalization**: Lowercasing and text standardization
3. **Stopword Processing**: Intelligent removal of non-essential words
4. **Quality Scoring**: Multi-dimensional quality assessment
5. **Answer Verification**: Reinforcement Learning-based answer validation
6. **Content Filtering**: Removal of inappropriate or incorrect content
### Problem Complexity Distribution
- **Basic Level** (30%): Fundamental mathematical concepts and operations
- **Intermediate Level** (30%): Multi-step problems requiring reasoning chains
- **Advanced Level** (40%): Complex mathematical challenges and proofs
### Mathematical Domains Covered
- Arithmetic and Number Theory, Algebra and Polynomial Mathematics, eometry and Trigonometry, Calculus and Analysis
>Note : domains are for reference only the actual data is very diverse and covers more domains than stated actual data have more complex and high level questions than stated.
## Use cases
- **Fine-tuning** mathematical reasoning capabilities in language models
- **Training** instruction-following models with mathematical focus
- **Benchmarking** model performance on mathematical reasoning tasks
- **Research** in mathematical AI and automated theorem proving
- **Educational** applications requiring step-by-step mathematical explanations
## Dataset Format
Each example contains:
- **Problem Statement**: Clear mathematical question or challenge
- **Step-by-Step Solution**: Detailed reasoning process
- **Final Answer**: Verified correct solution
## Quality Assurance
- **RL Verification**: All answers verified using reinforcement learning techniques
- **Correctness Guarantee**: Only problems with verified correct answers are included
- **Human Review**: Sample validation by mathematical experts
- **Automated Checks**: Computational verification where applicable
## Performance Metrics
Models trained on this dataset show significant improvements in:
- Mathematical reasoning accuracy
- Step-by-step explanation quality
- Problem-solving methodology
- Cross-domain mathematical transfer
## Acknowledgments
Special thanks to our partners and contributors:
- **NVIDIA** - Reference datsets and also the MathX contains many examples taken from Nvidia's existing datasets
- **Openr1** - Reference datsets and also the MathX contains many examples taken from Openr1's existing datasets
- **Modotte Team** - Dataset curation and quality assurance alsong with some extra currated examples
## Citation
**Anyone** can freely use and modify this dataset
## License
This dataset is released under [ Apache-2.0 - Lisence ].
```bibtex
@dataset{mathx2024,
title={MathX: Large-Scale Mathematical Reasoning Dataset},
author={Parvesh Rawal at Modotte},
year={2024},
publisher={Modotte},
url={https://huggingface.co/datasets/Modotte/MathX-5M}
}
```
## Contact
For questions, suggestions, or collaboration opportunities:
- **Email**: [Modotte](team@Modotte.com)
- **Twitter**: [@Modotte]
- **GitHub**: [Modotte]
---
*Built with ❤️ by Modotte - Advancing AI through high-quality data*
# Modotte
---
<p align="center">
<img
src="https://cdn-uploads.huggingface.co/production/uploads/677fcdf29b9a9863eba3f29f/lmv_L59KWnn0lQjpogWKx.png"
alt="MathX-5M 宣传横幅"
width="80%"
style="border-radius:15px;"
/>
</p>
> 注:本数据集隶属于Modotte打造的MathX系列数据集,该系列涵盖多款同类数据集,核心宗旨是为模型训练与微调提供高品质数据集资源。
本数据集从优质公开数据源中精选整理,并结合闭源与开源大语言模型(Large Language Model)生成的合成数据进行增强,可作为基于指令的模型微调与适配的坚实基础,是目前已公开的最精良、最全面的数学推理语料库之一。
### 重要说明:
- 带思考过程(Think:)的条目数:1,178,145
- 无思考过程(No Think:)的条目数:3,871,855
### 核心特性
- **规模**:包含500万条经过精心筛选的分步思考推理样本
- **多样性**:全面覆盖从基础算术到高等微积分的各类数学领域
- **质量**:经过多阶段过滤与验证流程
- **推理能力**:附带详细数学思考过程的分步解题方案
- **准确性**:经过强化学习(Reinforcement Learning)验证的标准答案与正确性校验
## 数据集概览
**MathX**是一款经过精心整理的数学推理数据集,专为基于指令的模型微调以及提升模型的思考能力而设计,是目前公开的规模最大、筛选最全面的数学推理语料库。
## 使用方法
python
pip install -U datasets fsspec
python
from datasets import load_dataset
dataset = load_dataset("Modotte/MathX-5M")
### 核心特性
- **规模**:包含500万条经过精心筛选的分步思考推理样本
- **多样性**:全面覆盖从基础算术到高等微积分的各类数学领域
- **质量**:经过多阶段过滤与验证流程
- **推理能力**:附带详细数学思考过程的分步解题方案
- **准确性**:经过强化学习验证的标准答案与正确性校验
## 数据整理流程
本数据集通过多源采集的方式精心构建:
### 数据来源
- **优质现有数据集**:从网络上的多款优质数学数据集精选整理而来(包括NVIDIA、Openr1、Modotte)
- **合成数据生成**:使用闭源与开源大语言模型生成合成数据
- **专家验证**:由人工核验数学解题过程与解释说明(Modotte团队)
### 过滤流程
我们的严格过滤流程包括:
1. **去重**:移除重复的题目与解题方案
2. **标准化处理**:统一小写格式与文本规范
3. **停用词处理**:智能移除非必要词汇
4. **质量评分**:多维度的质量评估
5. **答案验证**:基于强化学习的答案有效性校验
6. **内容过滤**:移除不当或错误内容
### 题目复杂度分布
- **基础层级(30%)**:基础数学概念与运算
- **中级层级(30%)**:需要推理链的多步解题问题
- **高级层级(40%)**:复杂数学挑战与证明题
### 覆盖数学领域
- 算术与数论、代数与多项式数学、几何与三角学、微积分与分析学
> 注:上述领域仅作参考,实际数据覆盖范围更广,包含更多未列明的领域,且实际题目比描述的更为复杂、高阶。
## 应用场景
- **微调**大语言模型的数学推理能力
- **训练**聚焦数学任务的指令遵循模型
- **基准测试**模型在数学推理任务中的性能表现
- **开展**数学人工智能与自动定理证明领域的研究
- **应用于**需要分步数学解释的教育场景
## 数据集格式
每条样本包含以下内容:
- **问题描述**:清晰的数学问题或挑战
- **分步解题过程**:详细的推理流程
- **最终答案**:经过验证的正确解
## 质量保障
- **强化学习验证**:所有答案均通过强化学习技术完成校验
- **正确性保障**:仅收录经过验证的正确答案对应的题目
- **人工复核**:由数学专家对样本进行抽样验证
- **自动化校验**:对可计算题目进行计算验证
## 性能指标
基于本数据集训练的模型在以下方面展现出显著提升:
- 数学推理准确率
- 分步解释质量
- 问题解决方法论
- 跨领域数学迁移能力
## 致谢
特别感谢以下合作伙伴与贡献者:
- **NVIDIA**:提供参考数据集,本数据集包含大量源自NVIDIA现有数据集的样本
- **Openr1**:提供参考数据集,本数据集包含大量源自Openr1现有数据集的样本
- **Modotte团队**:负责数据集整理与质量保障,并提供部分精选样本
## 引用说明
任何个人或机构均可自由使用与修改本数据集
## 开源许可
本数据集采用 [Apache-2.0 许可协议] 发布。
bibtex
@dataset{mathx2024,
title={MathX: Large-Scale Mathematical Reasoning Dataset},
author={Parvesh Rawal at Modotte},
year={2024},
publisher={Modotte},
url={https://huggingface.co/datasets/Modotte/MathX-5M}
}
## 联系方式
如有疑问、建议或合作意向,请通过以下方式联系:
- **邮箱**:[Modotte团队](team@Modotte.com)
- **Twitter**:[@Modotte]
- **GitHub**:[Modotte]
---
*由Modotte团队倾心打造——通过高品质数据推动人工智能进步 ❤️*
提供机构:
maas
创建时间:
2025-07-07
搜集汇总
数据集介绍

背景与挑战
背景概述
MathX-5M是一个包含500万高质量数学推理示例的数据集,覆盖广泛的数学领域,并通过严格的多阶段验证流程确保数据质量,适用于模型微调和数学AI研究。
以上内容由遇见数据集搜集并总结生成



