NuminaMath-1.5
收藏魔搭社区2026-05-11 更新2025-02-15 收录
下载链接:
https://modelscope.cn/datasets/AI-MO/NuminaMath-1.5
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for NuminaMath 1.5
## Dataset Description
- **Homepage:** https://projectnumina.ai
- **Repository:**
- **Paper:** https://github.com/project-numina/aimo-progress-prize/blob/main/report/numina_dataset.pdf
- **Leaderboard:**
- **Point of Contact:** [Jia Li](jia@projectnumina.ai)
### Dataset Summary
This is the second iteration of the popular [NuminaMath](https://huggingface.co/datasets/AI-MO/NuminaMath-CoT) dataset, bringing high quality post-training data for approximately 900k competition-level math problems. Each solution is formatted in a Chain of Thought (CoT) manner. The sources of the dataset range from Chinese high school math exercises to US and international mathematics olympiad competition problems. The data were primarily collected from online exam paper PDFs and mathematics discussion forums.
### What's new?
#### Problem metadata
After understanding the importance of verifiable output for each problem, we have added `answer`, `problem_type`, `question_type` metadata for all problems:
- `answer`: Final answer of the problem when `question_type` is a "math word problem", i.e. a number-valued output. For problems which do not belong to this category, `answer` takes one of the following special values:
- `proof`: When the `question_type` is proof
- `notfound`: When we cannot find the answer from the `ref_solution`
- `problem_type`: The mathematical domain of the problem. See `find_problem_type` for more information. Here are the supported types:
- Algebra
- Geometry
- Number Theory
- Combinatorics
- Calculus
- Inequalities
- Logic and Puzzles
- Other
- `question_type`: The form or style of the mathematical problem.
- multiple-choice question (MCQ)
- proof
- math-word-problem (problem with output)
#### Some new data (more to come)
- Olympiads Reference (source: olympiads ref). After the publication of the first [NuminaMath](https://huggingface.co/datasets/AI-MO/NuminaMath-CoT) dataset, we realized that there are a lot of parsing issues with the `olympiads` subset, due to the use of generic regular experessions and LLMs. To fix this, we have used the official websites from dozens of national Math Olympiads to perform manual parsing and verification of the problems and solutions.
- More manual curated data. `cn_contest`, `inequalities` and `number_theory` are manually curated competition problems provided by our data partners.
- Removal of synthetic dataset `synthetic_amc`. In our ablation study, this hurt a bit the performance. In the futhur we planned to remove all synthetic data until we find a way to reliably generate high-quality synthetic problems.
### Source breakdown
| source | problems | question_type:proof | question_type:mcq | question_type:word |
|:---------------|-----------:|----------------------:|--------------------:|---------------------:|
| olympiads | 197084 | 62970 | 13529 | 117845 |
| olympiads_ref | 3638 | 2246 | nan | 1392 |
| amc_aime | 5872 | 208 | 4374 | 963 |
| aops_forum | 67841 | 24532 | 5924 | 33486 |
| cn_contest | 29944 | 8663 | 5602 | 15649 |
| inequalities | 7314 | 5780 | 49 | 1478 |
| number_theory | 4043 | 2591 | 15 | 1239 |
| cn_k12 | 268819 | 3966 | 115800 | 149010 |
| orca_math | 151934 | 1 | 17 | 151916 |
| synthetic_math | 148712 | 41 | 1057 | 147612 |
| metamath | 11014 | nan | 82 | 10932 |
| Total | 896215 | 110998 | 146449 | 631522 |
### Licensing Information
The dataset is available under the [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0).
### Citation Information
```
@misc{numina_math_datasets,
author = {Jia LI and Edward Beeching and Lewis Tunstall and Ben Lipkin and Roman Soletskyi and Shengyi Costa Huang and Kashif Rasul and Longhui Yu and Albert Jiang and Ziju Shen and Zihan Qin and Bin Dong and Li Zhou and Yann Fleureau and Guillaume Lample and Stanislas Polu},
title = {NuminaMath},
year = {2024},
publisher = {Numina},
journal = {Hugging Face repository},
howpublished = {\url{[https://huggingface.co/datasets/AI-MO/NuminaMath-1.5](https://github.com/project-numina/aimo-progress-prize/blob/main/report/numina_dataset.pdf)}}
}
```
# NuminaMath 1.5 数据集卡片
## 数据集说明
- **项目主页:** https://projectnumina.ai
- **代码仓库:**
- **相关论文:** https://github.com/project-numina/aimo-progress-prize/blob/main/report/numina_dataset.pdf
- **排行榜:**
- **联系人:** [李佳(Jia Li)](jia@projectnumina.ai)
### 数据集概述
本数据集为广受欢迎的[NuminaMath](https://huggingface.co/datasets/AI-MO/NuminaMath-CoT)数据集的第二版,提供了约90万道竞赛级数学题的高质量后训练数据,每道题的解答均采用思维链(Chain of Thought, CoT)格式。数据集来源涵盖中国高中数学练习题、美国及国际数学奥林匹克竞赛试题,主要从在线考试试卷PDF和数学讨论论坛收集。
### 新增内容
#### 题目元数据
在认识到每道题可验证输出的重要性后,我们为所有题目新增了`answer`、`problem_type`、`question_type`三类元数据:
- `answer`:当`question_type`为“数学应用题(math word problem)”时,为该题的最终答案,即数值型输出;若不属于该类别,则`answer`取以下特殊值之一:
- `proof`:当`question_type`为证明题时
- `notfound`:当无法从`ref_solution`中找到答案时
- `problem_type`:题目所属的数学领域,详见`find_problem_type`,支持的类型包括:代数、几何、数论、组合数学、微积分、不等式、逻辑与谜题、其他
- `question_type`:数学题的形式或风格,包括:
- 选择题(multiple-choice question, MCQ)
- 证明题
- 数学应用题(math-word-problem,需输出结果的题目)
#### 新增数据(后续将持续补充)
1. 奥林匹克竞赛参考数据集(来源:olympiads ref):在首个[NuminaMath](https://huggingface.co/datasets/AI-MO/NuminaMath-CoT)数据集发布后,我们发现原`olympiads`子集因使用通用正则表达式和大语言模型(Large Language Model, LLM)存在大量解析问题。为修复该问题,我们通过数十个国家数学奥林匹克的官方网站,对题目和解答进行了手动解析与验证。
2. 更多人工精选数据:`cn_contest`、`inequalities`和`number_theory`均由我们的数据合作伙伴提供的人工精选竞赛题。
3. 移除合成数据集`synthetic_amc`:在消融实验中,该数据集对模型性能略有负面影响。后续我们计划移除所有合成数据,直至找到可靠生成高质量合成题的方法。
### 来源分布
| 来源 | 题目数量 | 证明题题型 | 选择题题型 | 应用题题型 |
|:---------------|-----------:|----------------------:|--------------------:|---------------------:|
| 奥林匹克竞赛 | 197084 | 62970 | 13529 | 117845 |
| 奥林匹克竞赛参考 | 3638 | 2246 | nan | 1392 |
| AMC/AIME竞赛 | 5872 | 208 | 4374 | 963 |
| AoPS论坛 | 67841 | 24532 | 5924 | 33486 |
| 中国竞赛题 | 29944 | 8663 | 5602 | 15649 |
| 不等式题 | 7314 | 5780 | 49 | 1478 |
| 数论题 | 4043 | 2591 | 15 | 1239 |
| 中国K12试题 | 268819 | 3966 | 115800 | 149010 |
| Orca数学 | 151934 | 1 | 17 | 151916 |
| 合成数学题 | 148712 | 41 | 1057 | 147612 |
| Metamath | 11014 | nan | 82 | 10932 |
| 总计 | 896215 | 110998 | 146449 | 631522 |
### 授权信息
本数据集采用[Apache许可证2.0版本](https://www.apache.org/licenses/LICENSE-2.0)发布。
### 引用信息
@misc{numina_math_datasets,
author = {李佳(Jia LI)、爱德华·比奇(Edward Beeching)、刘易斯·滕斯托尔(Lewis Tunstall)、本·利普金(Ben Lipkin)、罗曼·索列茨基(Roman Soletskyi)、黄圣义·科斯塔(Shengyi Costa Huang)、卡希夫·拉苏尔(Kashif Rasul)、于龙辉(Longhui Yu)、阿尔伯特·江(Albert Jiang)、沈子居(Ziju Shen)、秦子涵(Zihan Qin)、董斌(Bin Dong)、周立(Li Zhou)、扬·弗勒罗(Yann Fleureau)、纪尧姆·拉姆勒(Guillaume Lample)、斯坦尼斯拉斯·波卢(Stanislas Polu)},
title = {NuminaMath},
year = {2024},
publisher = {Numina},
journal = {Hugging Face 仓库},
howpublished = {url{https://github.com/project-numina/aimo-progress-prize/blob/main/report/numina_dataset.pdf}}
}
提供机构:
maas
创建时间:
2025-02-11



