MathLake
收藏魔搭社区2025-12-05 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/OpenDataArena/MathLake
下载链接
链接失效反馈官方服务:
资源简介:
# MathLake: A Large-Scale Mathematics Dataset
MathLake is a massive collection of **8.3 million** mathematical problems aggregated from over **50 open-source datasets**. Unlike datasets focused on filtering for the highest quality solutions immediately, MathLake prioritizes **query comprehensiveness**, serving as a universal "raw ore" for researchers to curate, distill, or annotate further. The dataset provides annotations for **Difficulty**, **Format**, and **Subject**, covering diverse mathematical fields from basic arithmetic to advanced topics.
<div align="center">
<img src="conceptual.png" alt="mathlake" width="1200" />
</div>
## Motivation
MathLake aims to provide a comprehensive foundation for LLM researchers on mathematical reasoning.
* **Query-Centric Focus**: We prioritize collecting massive queries from current datasets (over 50 sources), so researchers don't need to re-collect them.
* **Base for Distillation**: By focusing on the comprehensiveness of questions rather than just the perfection of existing answers, MathLake serves as an ideal starting point for **data curation and distillation**. Researchers can use these queries to generate new, higher-quality synthetic reasoning traces without worrying about data scarcity.
---
## Data Construction & Processing
To ensure MathLake serves as a robust foundation, we employed a rigorous multi-stage construction pipeline involving strict source selection and data cleaning.
### 1. Dataset Selection Criteria
We applied a rigorous screening process to select high-quality datasets updated after Jan 2023, prioritizing those with significant community impact, SFT-compatible formats, and balanced sizes to ensure a robust and modern training foundation.
### 2. Query Deduplication
Given that MathLake aggregates over 50 sources, overlapping problems are inevitable. We performed **query-level deduplication** to consolidate redundant queries to maintain a unique and diverse problem set.
### 3. Query Cleaning
Since some aggregated datasets are general reasoning collections containing mixed domains (e.g., code, science), we applied strict filtering to ensure domain purity. We specifically retained only queries belonging to the **mathematics domain**. Additionally, to ensure language consistency, we filtered out non-English queries, preserving only those written in **English**.
### 4. Answer Extraction
We also perform **final answer extraction** from the original response of each problem. The answers are extracted by LLM with a carefully designed prompt. Only valid answers are extracted. For problems with truncated responses, ill-formed answers or proof statements, empty answers are given. Note that since we do not verify the correctness of each response, further verification or selection is needed for RL training.
---
## Dataset Composition
The dataset is composed of more than 50 different sources. It combines synthetic data, human-annotated competition problems, and variants of standard benchmarks. The top contributors include:
| Source | Count |
| :--- | :--- |
| ScaleQuest-Math | 1,001,915 |
| MathGradeSchool | 966,513 |
| Maths-College | 956,620 |
| NuminaMath-CoT | 801,161 |
| OpenMathInstruct-2 | 592,460 |
| MegaScience | 413,842 |
| GSM8K-Aug | 378,553 |
| NuminaMath1.5 | 323,307 |
| Orca-AgentInstruct-Shuffle | 302,264 |
| OpenMathReasoning-CoT | 278,784 |
| MathPlus | 246,755 |
| MagpieV2-250k-R1Llama70B | 235,652 |
| MathInstruct | 209,589 |
| MiroMind-M1-SFT-719K | 205,942 |
| SCP116K | 157,210 |
---
## Data Structure
Each record in the dataset is standardized to the following schema. We have unified differing column names (e.g., `prompt`, `instruction`, `problem`) into a single `question` field. Each entry contains:
```json
{
"id": "Unique identifier for the problem",
"source": "The original source dataset",
"question": "The mathematical problem statement",
"response": "The original solution or reasoning trace provided by the source dataset",
"extracted_answer": "The final answer extracted from the response (numeric or short text)",
"subject": "The mathematical field",
"format": "The format of the question",
"difficulty": "Estimated difficulty level of the problem"
}
```
---
## Metadata Annotation
To make this massive dataset actionable, we developed a specialized LLM-based annotation pipeline. Each problem was processed by three distinct "Expert Personas" to generate consistent metadata.
> [!WARNING]
> **Annotation Method:** All tags were generated by LLMs using the specific taxonomies and decision rules outlined below. While we employed strong prompts to ensure high accuracy, edge cases may still exist.
### Statistics & Distributions
#### Subject Distribution
We employed an "Expert Curriculum Designer" persona to classify queries based purely on text content (without solving). The schema consists of **12 distinct subjects**:
* **Arithmetic** (Whole numbers, basic operations)
* **Pre-Algebra** (Ratios, order of operations)
* **Algebra** (Polynomials, functions, systems)
* **Geometry** (Shapes, coordinates, area/volume)
* **Trigonometry** (Identities, unit circle)
* **Calculus** (Limits, derivatives, integrals)
* **Linear Algebra** (Matrices, vectors, eigenvalues)
* **Probability & Statistics** (Distributions, inference)
* **Combinatorics** (Counting, permutations, pigeonhole)
* **Number Theory** (Primes, modular arithmetic)
* **Logic & Discrete Math** (Boolean algebra, graph theory, proofs)
* **Other**
<img src="plot_subject_distribution.png" alt="Subject Distribution" width="1200" />
<img src="plot_subject_distribution_pie.png" alt="Subject Distribution" width="800" />
#### Difficulty Distribution
Difficulty is estimated on a **1-10 scale**, explicitly mapped to the [AoPS ratings](https://artofproblemsolving.com/wiki/index.php/AoPS_Wiki:Competition_ratings). This allows for precise curriculum learning (e.g., training on Level 1-3 before attempting Level 8).
| Level | Equivalent Competition Tier | Description |
| :--- | :--- | :--- |
| **1** | **Elementary / Middle School** | MOEMS, AMC 8 (Early Qs). Standard word problems. |
| **2** | **Junior High** | AMC 8 (Hard), AMC 10 (Early). Complex word problems. |
| **3** | **High School Beginner** | AMC 10 (Mid), AMC 12 (Early). Requires creative thinking. |
| **4** | **High School Intermediate** | AMC 12 (Mid), AIME (Early). Intermediate complexity. |
| **5** | **Advanced High School** | AIME (Mid), JBMO. Simple proof-based Olympiad style. |
| **6** | **Pre-Olympiad** | AIME (Hard), USAJMO. Introductory Olympiad level. |
| **7** | **Olympiad (Entry)** | IMO (Easy/Medium), USAMO. Requires technical knowledge. |
| **8** | **Olympiad (Medium)** | IMO (Medium/Hard). High-level competition problems. |
| **9** | **Olympiad (Expert)** | IMO (Hard). Expert-level constructions/proofs. |
| **10** | **Historically Hard** | Outliers. Exceedingly tedious or difficult even for Olympians. |
<img src="plot_difficulty_distribution_curve.png" alt="Difficulty Distribution" width="600" />
#### Format Distribution
We classify the *structure* of the question to aid in evaluation setup. The classification follows a strict **hierarchical decision tree**:
1. **Multiple Choice:** (Highest Priority) Contains explicit options (A/B/C/D) or "Select all that apply."
2. **Proof:** Explicitly asks to "prove," "show," "justify," or "verify."
3. **Fill-in-the-Blank:** Contains blank indicators (`__`, `[blank]`) or explicit formatting requests.
4. **Problem Solving:** (Default) Standard open-ended computation, word problems, or derivations.
<img src="plot_format_distribution.png" alt="Format Distribution" width="600" />
---
## Future Work
We are continuously:
* Expanding the dataset with more high-quality mathematical sources.
* Refining the difficulty and subject classification models for higher accuracy.
* Adding more fine-grained annotations for solution steps and reasoning types.
---
## About OpenDataArena
[OpenDataArena](https://opendataarena.github.io/) is an open research platform dedicated to **discovering, evaluating, and advancing high-quality datasets for AI post-training**. It provides a transparent, data-centric ecosystem to support reproducible dataset evaluation and sharing.
**Key Features:**
* 🏆 **Dataset Leaderboard** — helps researchers identify **the most valuable and high-quality datasets across different domains**.
* 📊 **Detailed Evaluation Scores** — provides **comprehensive metrics** to assess data quality, complexity, difficulty etc.
* 🧰 **Data Processing Toolkit** — [OpenDataArena-Tool](https://github.com/OpenDataArena/OpenDataArena-Tool) offers an open-source pipeline for dataset curation and scoring.
If you find our work helpful, please consider **⭐ starring and subscribing** to support our research.
---
## Citation
```bibtex
@dataset{opendataarena_mathlake_2025,
author = {OpenDataArena},
title = {MathLake: A Large-Scale Mathematics Dataset},
year = {2025},
publisher = {Hugging Face}
}
```
# MathLake:大规模数学数据集
MathLake是一个汇集了超50个开源数据集的大规模数学问题集合,总规模达**830万**道数学题。与那些优先即刻筛选高质量解答的数据集不同,MathLake将**查询全面性**置于首位,可作为通用的“原矿”,供研究人员进一步筛选、提纯或标注。该数据集提供了**难度、题型、学科**三类标注,覆盖从基础算术到高等主题的各类数学领域。
<div align="center">
<img src="conceptual.png" alt="mathlake" width="1200" />
</div>
## 研发动机
MathLake旨在为大语言模型(LLM)领域的数学推理研究提供全面的基础支撑。
* **以查询为核心的聚焦方向**:我们优先从现有数据集(超50个来源)中采集大量数学问题,免去研究人员重复采集的工作。
* **作为提纯的基础**:相较于追求现有解答的完美性,我们更注重问题本身的全面性,因此MathLake是**数据筛选与提纯**的理想起点。研究人员可借助这些问题生成质量更高的合成推理轨迹,无需担忧数据匮乏问题。
---
## 数据构建与处理流程
为确保MathLake能作为可靠的基础数据集,我们采用了严格的多阶段构建流程,包含严谨的源数据筛选与数据清洗环节。
### 1. 数据集筛选标准
我们采用了严格的筛选流程,选取2023年1月之后更新的高质量数据集,优先选择社区影响力显著、适配监督微调(Supervised Fine-Tuning, SFT)格式且规模均衡的数据集,以打造可靠且贴合当下需求的训练基础。
### 2. 问题去重
鉴于MathLake汇集了超50个数据源,出现重复问题在所难免。我们执行了**问题级去重**操作,合并冗余问题,以确保数据集的唯一性与多样性。
### 3. 问题清洗
由于部分聚合数据集属于通用推理集合,包含混合领域内容(如代码、科学类问题),我们执行了严格的筛选以保证领域纯度。我们仅保留属于**数学领域**的问题。此外,为保证语言一致性,我们过滤掉非英语问题,仅保留**英语**表述的题目。
### 4. 答案提取
我们还会从每道题的原始解答中提取**最终答案**。答案由大语言模型(LLM)借助精心设计的提示词完成提取,仅保留有效答案。对于存在截断回复、格式错误答案或证明陈述的题目,我们将其答案设为空。请注意,由于我们未验证每一条回复的正确性,若用于强化学习(Reinforcement Learning, RL)训练,还需进一步验证或筛选。
---
## 数据集构成
该数据集汇集了超50个不同来源的数据,涵盖合成数据、人工标注的竞赛试题以及标准基准测试的变体。主要贡献来源如下:
| 来源 | 数量 |
| :--- | :--- |
| ScaleQuest-Math | 1,001,915 |
| MathGradeSchool | 966,513 |
| Maths-College | 956,620 |
| NuminaMath-CoT | 801,161 |
| OpenMathInstruct-2 | 592,460 |
| MegaScience | 413,842 |
| GSM8K-Aug | 378,553 |
| NuminaMath1.5 | 323,307 |
| Orca-AgentInstruct-Shuffle | 302,264 |
| OpenMathReasoning-CoT | 278,784 |
| MathPlus | 246,755 |
| MagpieV2-250k-R1Llama70B | 235,652 |
| MathInstruct | 209,589 |
| MiroMind-M1-SFT-719K | 205,942 |
| SCP116K | 157,210 |
---
## 数据结构
数据集中的每条记录均遵循以下标准化schema(模式)。我们将不同的列名(如`prompt`、`instruction`、`problem`)统一为单个`question`字段。每条数据包含以下内容:
json
{
"id": "该题目的唯一标识符",
"source": "原始来源数据集",
"question": "数学问题题干",
"response": "原始数据集提供的解答或推理轨迹",
"extracted_answer": "从回复中提取的最终答案(数值或简短文本)",
"subject": "所属数学领域",
"format": "题目题型",
"difficulty": "该题的预估难度等级"
}
---
## 元数据标注
为让这一超大规模数据集具备可操作性,我们开发了基于大语言模型(LLM)的专属标注流程。每道题目均由三个不同的“专家角色”处理,以生成一致的元数据。
> ⚠️ 【标注说明】:所有标签均由大语言模型(LLM)依据下文指定的分类体系与决策规则生成。尽管我们采用了严谨的提示词以保障标注精度,但仍可能存在边缘案例。
### 统计与分布情况
#### 主题分布
我们采用“课程设计专家”角色,仅依据题目文本内容进行分类(无需解题)。分类体系包含**12个独立的数学领域**:
* **算术**(整数、基础运算)
* **初等代数**(比例、运算顺序)
* **代数**(多项式、函数、方程组)
* **几何**(图形、坐标、面积/体积)
* **三角学**(恒等式、单位圆)
* **微积分**(极限、导数、积分)
* **线性代数**(矩阵、向量、特征值)
* **概率与统计**(分布、推断)
* **组合数学**(计数、排列、鸽巢原理)
* **数论**(质数、模运算)
* **逻辑与离散数学**(布尔代数、图论、证明)
* **其他**
<img src="plot_subject_distribution.png" alt="主题分布" width="1200" />
<img src="plot_subject_distribution_pie.png" alt="主题分布饼图" width="800" />
#### 难度分布
难度采用**1-10分制**进行预估,且明确映射至[AoPS评分体系](https://artofproblemsolving.com/wiki/index.php/AoPS_Wiki:Competition_ratings),可支持精准的课程式学习(例如先训练1-3级题目,再尝试8级题目)。
| 难度等级 | 对应竞赛层级 | 描述 |
| :--- | :--- | :--- |
| **1** | **小学/初中阶段** | MOEMS、AMC 8(简单题)。标准应用题。 |
| **2** | **初中阶段** | AMC 8(难题)、AMC 10(简单题)。复杂应用题。 |
| **3** | **高中入门阶段** | AMC 10(中等题)、AMC 12(简单题)。需具备创造性思维。 |
| **4** | **高中进阶阶段** | AMC 12(中等题)、AIME(简单题)。复杂度中等。 |
| **5** | **高等高中阶段** | AIME(中等题)、JBMO。简单的证明类竞赛题。 |
| **6** | **竞赛预备阶段** | AIME(难题)、USAJMO。入门级竞赛水平。 |
| **7** | **入门级竞赛** | IMO(简单/中等题)、USAMO。需掌握专业知识。 |
| **8** | **中级竞赛** | IMO(中等/难题)。高水平竞赛试题。 |
| **9** | **专家级竞赛** | IMO(难题)。专家级构造/证明题。 |
| **10** | **史上最难** | 极端案例。即便对竞赛选手而言也极为繁琐或困难。 |
<img src="plot_difficulty_distribution_curve.png" alt="难度分布曲线" width="600" />
#### 题型分布
我们对题目**结构**进行分类,以辅助评估流程搭建。分类遵循严格的**层级决策树**规则:
1. **选择题**:(最高优先级)包含明确选项(A/B/C/D)或“选择所有符合项”类提示。
2. **证明题**:明确要求“证明”“推导”“论证”或“验证”。
3. **填空题**:包含空白占位符(`__`、`[blank]`)或明确的填空格式要求。
4. **常规解答题**:(默认类别)标准开放式计算、应用题或推导题。
<img src="plot_format_distribution.png" alt="题型分布" width="600" />
---
## 后续工作计划
我们正持续推进以下工作:
* 扩充数据集,纳入更多高质量数学数据源。
* 优化难度与学科分类模型,提升标注精度。
* 为解题步骤与推理类型添加更细粒度的标注。
---
## 关于OpenDataArena
[OpenDataArena](https://opendataarena.github.io/)是一个开源研究平台,致力于**发掘、评估并优化用于人工智能后训练的高质量数据集**。该平台打造了透明、以数据为核心的生态系统,支持可复现的数据集评估与共享。
**核心功能:**
* 🏆 **数据集排行榜** — 帮助研究人员识别**各领域中最具价值与高质量的数据集**。
* 📊 **详细评估得分** — 提供**全面的评估指标**,以衡量数据质量、复杂度、难度等属性。
* 🧰 **数据处理工具集** — [OpenDataArena-Tool](https://github.com/OpenDataArena/OpenDataArena-Tool) 提供开源流程,用于数据集筛选与评分。
若您认为本工作对您有所帮助,请考虑**⭐ 点亮星标并关注**,以支持我们的研究。
---
## 引用格式
bibtex
@dataset{opendataarena_mathlake_2025,
author = {OpenDataArena},
title = {MathLake: A Large-Scale Mathematics Dataset},
year = {2025},
publisher = {Hugging Face}
}
提供机构:
maas
创建时间:
2025-11-29



