five

MathX-5M

收藏
魔搭社区2026-05-12 更新2025-07-12 收录
下载链接:
https://modelscope.cn/datasets/XenArcAI/MathX-5M
下载链接
链接失效反馈
官方服务:
资源简介:
# Modotte --- <p align="center"> <img src="https://cdn-uploads.huggingface.co/production/uploads/677fcdf29b9a9863eba3f29f/lmv_L59KWnn0lQjpogWKx.png" alt="MathX-5M Banner" width="80%" style="border-radius:15px;" /> </p> > Note : This datset is the part of a lineup MathX by Modotte you can get a lots of datasets on this same linup main focus is to provide very high quality datasets for model training > and finetuning This dataset is curated from high-quality public sources and enhanced with synthetic data from both closed and open-source models. It serves as a strong foundation for instruction-based model tuning and fine-tuning, offering one of the most refined and extensive corpora available. ### Important: - Num of rows with Think: 1,178,145 - Num of rows with No Think: 3,871,855 ### Key Features - **Scale**: 5 million examples of highly curated step-by-step thinking data - **Diversity**: Comprehensive coverage of mathematical domains from basic arithmetic to advanced calculus - **Quality**: Multi-stage filtering and verification processes - **Reasoning**: Step-by-step solutions with detailed mathematical thinking - **Accuracy**: RL-verified answers with correctness validation ## Dataset Overview **MathX** is a meticulously curated mathematical reasoning dataset designed specifically for instruction-based model tuning and fine-tuning of existing models with enhanced thinking capabilities. This represents the largest and most comprehensively filtered corpus of publicly available mathematical reasoning data. ## How to use? ```python pip install -U datasets fsspec ``` ```python from datasets import load_dataset dataset = load_dataset("Modotte/MathX-5M") ``` ### Key Features - **Scale**: 5 million examples of highly curated step-by-step thinking data - **Diversity**: Comprehensive coverage of mathematical domains from basic arithmetic to advanced calculus - **Quality**: Multi-stage filtering and verification processes - **Reasoning**: Step-by-step solutions with detailed mathematical thinking - **Accuracy**: RL-verified answers with correctness validation ## Data Curation Process This dataset has been carefully constructed through a multi-source approach: ### Data Sources - **High-Quality Existing Datasets**: Curated from multiple premium mathematical datasets available online(Nvidia, Openr1, Modotte) - **Synthetic Generation**: Generated using both closed-source and open-source language models(Modotte) - **Expert Validation**: Human-verified mathematical solutions and explanations(Modotte) ### Filtering Pipeline Our rigorous filtering process includes: 1. **Deduplication**: Removal of duplicate problems and solutions 2. **Normalization**: Lowercasing and text standardization 3. **Stopword Processing**: Intelligent removal of non-essential words 4. **Quality Scoring**: Multi-dimensional quality assessment 5. **Answer Verification**: Reinforcement Learning-based answer validation 6. **Content Filtering**: Removal of inappropriate or incorrect content ### Problem Complexity Distribution - **Basic Level** (30%): Fundamental mathematical concepts and operations - **Intermediate Level** (30%): Multi-step problems requiring reasoning chains - **Advanced Level** (40%): Complex mathematical challenges and proofs ### Mathematical Domains Covered - Arithmetic and Number Theory, Algebra and Polynomial Mathematics, eometry and Trigonometry, Calculus and Analysis >Note : domains are for reference only the actual data is very diverse and covers more domains than stated actual data have more complex and high level questions than stated. ## Use cases - **Fine-tuning** mathematical reasoning capabilities in language models - **Training** instruction-following models with mathematical focus - **Benchmarking** model performance on mathematical reasoning tasks - **Research** in mathematical AI and automated theorem proving - **Educational** applications requiring step-by-step mathematical explanations ## Dataset Format Each example contains: - **Problem Statement**: Clear mathematical question or challenge - **Step-by-Step Solution**: Detailed reasoning process - **Final Answer**: Verified correct solution ## Quality Assurance - **RL Verification**: All answers verified using reinforcement learning techniques - **Correctness Guarantee**: Only problems with verified correct answers are included - **Human Review**: Sample validation by mathematical experts - **Automated Checks**: Computational verification where applicable ## Performance Metrics Models trained on this dataset show significant improvements in: - Mathematical reasoning accuracy - Step-by-step explanation quality - Problem-solving methodology - Cross-domain mathematical transfer ## Acknowledgments Special thanks to our partners and contributors: - **NVIDIA** - Reference datsets and also the MathX contains many examples taken from Nvidia's existing datasets - **Openr1** - Reference datsets and also the MathX contains many examples taken from Openr1's existing datasets - **Modotte Team** - Dataset curation and quality assurance alsong with some extra currated examples ## Citation **Anyone** can freely use and modify this dataset ## License This dataset is released under [ Apache-2.0 - Lisence ]. ```bibtex @dataset{mathx2024, title={MathX: Large-Scale Mathematical Reasoning Dataset}, author={Parvesh Rawal at Modotte}, year={2024}, publisher={Modotte}, url={https://huggingface.co/datasets/Modotte/MathX-5M} } ``` ## Contact For questions, suggestions, or collaboration opportunities: - **Email**: [Modotte](team@Modotte.com) - **Twitter**: [@Modotte] - **GitHub**: [Modotte] --- *Built with ❤️ by Modotte - Advancing AI through high-quality data*

# Modotte --- <p align="center"> <img src="https://cdn-uploads.huggingface.co/production/uploads/677fcdf29b9a9863eba3f29f/lmv_L59KWnn0lQjpogWKx.png" alt="MathX-5M 宣传横幅" width="80%" style="border-radius:15px;" /> </p> > 注:本数据集隶属于Modotte打造的MathX系列数据集,该系列涵盖多款同类数据集,核心宗旨是为模型训练与微调提供高品质数据集资源。 本数据集从优质公开数据源中精选整理,并结合闭源与开源大语言模型(Large Language Model)生成的合成数据进行增强,可作为基于指令的模型微调与适配的坚实基础,是目前已公开的最精良、最全面的数学推理语料库之一。 ### 重要说明: - 带思考过程(Think:)的条目数:1,178,145 - 无思考过程(No Think:)的条目数:3,871,855 ### 核心特性 - **规模**:包含500万条经过精心筛选的分步思考推理样本 - **多样性**:全面覆盖从基础算术到高等微积分的各类数学领域 - **质量**:经过多阶段过滤与验证流程 - **推理能力**:附带详细数学思考过程的分步解题方案 - **准确性**:经过强化学习(Reinforcement Learning)验证的标准答案与正确性校验 ## 数据集概览 **MathX**是一款经过精心整理的数学推理数据集,专为基于指令的模型微调以及提升模型的思考能力而设计,是目前公开的规模最大、筛选最全面的数学推理语料库。 ## 使用方法 python pip install -U datasets fsspec python from datasets import load_dataset dataset = load_dataset("Modotte/MathX-5M") ### 核心特性 - **规模**:包含500万条经过精心筛选的分步思考推理样本 - **多样性**:全面覆盖从基础算术到高等微积分的各类数学领域 - **质量**:经过多阶段过滤与验证流程 - **推理能力**:附带详细数学思考过程的分步解题方案 - **准确性**:经过强化学习验证的标准答案与正确性校验 ## 数据整理流程 本数据集通过多源采集的方式精心构建: ### 数据来源 - **优质现有数据集**:从网络上的多款优质数学数据集精选整理而来(包括NVIDIA、Openr1、Modotte) - **合成数据生成**:使用闭源与开源大语言模型生成合成数据 - **专家验证**:由人工核验数学解题过程与解释说明(Modotte团队) ### 过滤流程 我们的严格过滤流程包括: 1. **去重**:移除重复的题目与解题方案 2. **标准化处理**:统一小写格式与文本规范 3. **停用词处理**:智能移除非必要词汇 4. **质量评分**:多维度的质量评估 5. **答案验证**:基于强化学习的答案有效性校验 6. **内容过滤**:移除不当或错误内容 ### 题目复杂度分布 - **基础层级(30%)**:基础数学概念与运算 - **中级层级(30%)**:需要推理链的多步解题问题 - **高级层级(40%)**:复杂数学挑战与证明题 ### 覆盖数学领域 - 算术与数论、代数与多项式数学、几何与三角学、微积分与分析学 > 注:上述领域仅作参考,实际数据覆盖范围更广,包含更多未列明的领域,且实际题目比描述的更为复杂、高阶。 ## 应用场景 - **微调**大语言模型的数学推理能力 - **训练**聚焦数学任务的指令遵循模型 - **基准测试**模型在数学推理任务中的性能表现 - **开展**数学人工智能与自动定理证明领域的研究 - **应用于**需要分步数学解释的教育场景 ## 数据集格式 每条样本包含以下内容: - **问题描述**:清晰的数学问题或挑战 - **分步解题过程**:详细的推理流程 - **最终答案**:经过验证的正确解 ## 质量保障 - **强化学习验证**:所有答案均通过强化学习技术完成校验 - **正确性保障**:仅收录经过验证的正确答案对应的题目 - **人工复核**:由数学专家对样本进行抽样验证 - **自动化校验**:对可计算题目进行计算验证 ## 性能指标 基于本数据集训练的模型在以下方面展现出显著提升: - 数学推理准确率 - 分步解释质量 - 问题解决方法论 - 跨领域数学迁移能力 ## 致谢 特别感谢以下合作伙伴与贡献者: - **NVIDIA**:提供参考数据集,本数据集包含大量源自NVIDIA现有数据集的样本 - **Openr1**:提供参考数据集,本数据集包含大量源自Openr1现有数据集的样本 - **Modotte团队**:负责数据集整理与质量保障,并提供部分精选样本 ## 引用说明 任何个人或机构均可自由使用与修改本数据集 ## 开源许可 本数据集采用 [Apache-2.0 许可协议] 发布。 bibtex @dataset{mathx2024, title={MathX: Large-Scale Mathematical Reasoning Dataset}, author={Parvesh Rawal at Modotte}, year={2024}, publisher={Modotte}, url={https://huggingface.co/datasets/Modotte/MathX-5M} } ## 联系方式 如有疑问、建议或合作意向,请通过以下方式联系: - **邮箱**:[Modotte团队](team@Modotte.com) - **Twitter**:[@Modotte] - **GitHub**:[Modotte] --- *由Modotte团队倾心打造——通过高品质数据推动人工智能进步 ❤️*
提供机构:
maas
创建时间:
2025-07-07
搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
MathX-5M是一个包含500万高质量数学推理示例的数据集,覆盖广泛的数学领域,并通过严格的多阶段验证流程确保数据质量,适用于模型微调和数学AI研究。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作