naimulislam/reasoning-math-advanced-1m

Name: naimulislam/reasoning-math-advanced-1m
Creator: naimulislam
Published: 2025-12-20 17:47:33
License: 暂无描述

Hugging Face2025-12-20 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/naimulislam/reasoning-math-advanced-1m

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit task_categories: - text-generation - question-answering language: - en tags: - reasoning - chain-of-thought - synthetic - math - logic - cot size_categories: - 1M<n<10M pretty_name: Reasoning Math Advanced 1M --- # 🧠 Reasoning Math Advanced 1M ## 📖 Dataset Summary Reasoning Math Advanced 1M is a large-scale, synthetic dataset designed to enhance the reasoning capabilities of Large Language Models (LLMs). Comprising 1,000,000 unique samples, this dataset focuses on Math, Logic, and Common Sense reasoning tasks. A unique feature of this dataset is its adaptive reasoning structure, where the presence of Chain-of-Thought (CoT) reasoning scales with difficulty. All reasoning traces are encapsulated within specific `<thinking>` tags to facilitate distinct internal monologue training. ## ⚙️ Dataset Structure Each data point contains the following fields: | Field | Type | Description | | :--- | :--- | :--- | | `serial_number` | int | Unique identifier for the sample. | | `difficulty` | str | The complexity level: Easy, Medium, or Hard. | | `question` | str | The input query or problem statement. | | `reasoning` | str | The internal thought process (CoT). Wrapped in `<thinking>...</thinking>`. | | `final_answer` | str | The concise final conclusion or solution. | ### Data Instance Example ```json { "serial_number": 4052, "difficulty": "Hard", "question": "Solve for x and y: 1) 3x + 2y = 12, 2) 5x - y = 7", "reasoning": "<thinking>Step 1: Multiply Eq 2 by 2 to align y coefficients: 10x - 2y = 14.\nStep 2: Add modified Eq 2 to Eq 1: (3x + 10x) + (2y - 2y) = 12 + 14 -> 13x = 26.\nStep 3: Solve for x: x = 2.\nStep 4: Substitute x back into Eq 2: 5(2) - y = 7 -> 10 - 7 = y -> y = 3.</thinking>", "final_answer": "x=2, y=3" } ``` ## 🧠 Difficulty & Reasoning Logic The dataset is engineered to teach models when to think, not just how to think. The reasoning field behavior is strictly determined by the difficulty classification: | Difficulty | Reasoning Presence | Description | | :--- | :--- | :--- | | **Easy** | 0% (Empty) | Direct Q&A. The model learns to answer simple queries immediately without over-thinking. | | **Medium** | 50% | Mixed behavior. The model learns that intermediate difficulty sometimes requires thought, sometimes does not. | | **Hard** | 100% | Full CoT. The reasoning field is always populated and wrapped in `<thinking>` tags. | ## 💻 How to Use You can load this dataset directly using the Hugging Face datasets library: ```python from datasets import load_dataset # Replace with your actual repo name dataset = load_dataset("naimulislam/reasoning-math-advanced-1m") # Inspect a sample print(dataset['train'][0]) ``` ## Training Prompt Format To utilize the `<thinking>` tags effectively during fine-tuning, we recommend a prompt format that encourages the model to generate the opening tag. **Input Template:** ```text Question: {question} Answer: ``` **Target Output (Hard):** ```text <thinking> ...reasoning steps... </thinking> {final_answer} ``` ## 🛠️ Dataset Creation This dataset was synthetically generated using a specialized logic engine that creates diverse problem sets across: * **Linear Algebra** (Systems of equations) * **Polynomial Expansion** (FOIL method) * **Logic Puzzles** (Transitive properties, Truth tables) * **Arithmetic Sequences** * **Probability Theory** ## 📜 License This dataset is released under the MIT License. ## 🤝 Citation If you use this dataset in your research, please credit: ```bibtex @dataset{massive_reasoning_1m, author = Naimul Islam Nahid, title = Reasoning Math Advanced 1M, year = 2025, publisher = Hugging Face, journal = naimulislam/reasoning-math-advanced-1m, } ```

--- license: MIT许可证 task_categories: - 文本生成 - 问答 language: - 英语 tags: - 推理 - 思维链（Chain-of-Thought，CoT） - 合成 - 数学 - 逻辑 - CoT size_categories: - 100万<样本数<1000万 pretty_name: Reasoning Math Advanced 1M --- ## 🧠 推理数学进阶1M ## 📖 数据集概述推理数学进阶1M是一款大规模合成数据集，旨在提升大语言模型（Large Language Model，LLM）的推理能力。该数据集包含100万个独特样本，专注于数学、逻辑与常识推理任务。本数据集的独特之处在于其自适应推理结构：思维链（Chain-of-Thought，CoT）推理的出现比例随样本难度动态调整。所有推理轨迹均被包裹在专用的`<thinking>`标签中，以便开展差异化的内部独白训练。 ## ⚙️ 数据集结构每个数据点包含以下字段： | 字段名 | 数据类型 | 字段说明 | | :--- | :--- | :--- | | `serial_number` | int | 样本唯一标识符。 | | `difficulty` | str | 复杂度等级，分为简单（Easy）、中等（Medium）与困难（Hard）三级。 | | `question` | str | 输入查询或问题描述。 | | `reasoning` | str | 内部思维过程（即思维链推理），被包裹在`<thinking>...</thinking>`标签内。 | | `final_answer` | str | 简洁的最终结论或解决方案。 | ### 数据实例示例 json { "serial_number": 4052, "difficulty": "Hard", "question": "Solve for x and y: 1) 3x + 2y = 12, 2) 5x - y = 7", "reasoning": "<thinking>Step 1: Multiply Eq 2 by 2 to align y coefficients: 10x - 2y = 14. Step 2: Add modified Eq 2 to Eq 1: (3x + 10x) + (2y - 2y) = 12 + 14 -> 13x = 26. Step 3: Solve for x: x = 2. Step 4: Substitute x back into Eq 2: 5(2) - y = 7 -> 10 - 7 = y -> y = 3.</thinking>", "final_answer": "x=2, y=3" } ## 🧠 难度与推理逻辑本数据集的设计目标是教会模型“何时进行推理”，而非仅掌握“如何推理”。推理字段的内容严格由样本的难度等级决定： | 难度等级 | 推理内容占比 | 字段说明 | | :--- | :--- | :--- | | **简单（Easy）** | 0%（空内容） | 直接问答模式，模型可无需过度思考即可直接回答简单查询。 | | **中等（Medium）** | 50% | 混合模式，模型将学习到中等难度的问题有时需要推理，有时则无需。 | | **困难（Hard）** | 100% | 完整思维链模式，推理字段始终包含被`<thinking>`标签包裹的完整推理过程。 | ## 💻 使用方法您可直接通过Hugging Face数据集库加载该数据集： python from datasets import load_dataset # Replace with your actual repo name dataset = load_dataset("naimulislam/reasoning-math-advanced-1m") # Inspect a sample print(dataset['train'][0]) ## 训练提示模板为在微调阶段有效利用`<thinking>`标签，我们推荐采用可引导模型生成该起始标签的提示模板。 **输入模板：** text Question: {question} Answer: **目标输出（困难样本）：** text <thinking> ...reasoning steps... </thinking> {final_answer} ## 🛠️ 数据集构建本数据集通过专用逻辑引擎合成生成，涵盖以下多样题型： * **线性代数（Linear Algebra）**（方程组求解） * **多项式展开（Polynomial Expansion）**（FOIL法则） * **逻辑谜题（Logic Puzzles）**（传递性属性、真值表） * **算术序列（Arithmetic Sequences）** * **概率论（Probability Theory）** ## 📜 许可证本数据集采用MIT许可证发布。 ## 🤝 引用格式若您在研究中使用该数据集，请引用以下文献： bibtex @dataset{massive_reasoning_1m, author = Naimul Islam Nahid, title = Reasoning Math Advanced 1M, year = 2025, publisher = Hugging Face, journal = naimulislam/reasoning-math-advanced-1m, }

提供机构：

naimulislam

5,000+

优质数据集

54 个

任务类型

进入经典数据集