lufish01/OpenR1-Math-220k
收藏Hugging Face2026-03-14 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/lufish01/OpenR1-Math-220k
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
language:
- en
configs:
- config_name: all
data_files:
- split: train
path: all/train-*
- config_name: default
data_files:
- split: train
path: data/train-*
- config_name: extended
data_files:
- split: train
path: extended/train-*
dataset_info:
- config_name: all
features:
- name: problem
dtype: string
- name: solution
dtype: string
- name: answer
dtype: string
- name: problem_type
dtype: string
- name: question_type
dtype: string
- name: source
dtype: string
- name: uuid
dtype: string
- name: is_reasoning_complete
sequence: bool
- name: generations
sequence: string
- name: correctness_math_verify
sequence: bool
- name: correctness_llama
sequence: bool
- name: finish_reasons
sequence: string
- name: correctness_count
dtype: int64
- name: messages
list:
- name: content
dtype: string
- name: role
dtype: string
splits:
- name: train
num_bytes: 9734110026.0
num_examples: 225129
download_size: 4221672067
dataset_size: 9734110026.0
- config_name: default
features:
- name: problem
dtype: string
- name: solution
dtype: string
- name: answer
dtype: string
- name: problem_type
dtype: string
- name: question_type
dtype: string
- name: source
dtype: string
- name: uuid
dtype: string
- name: is_reasoning_complete
sequence: bool
- name: generations
sequence: string
- name: correctness_math_verify
sequence: bool
- name: correctness_llama
sequence: bool
- name: finish_reasons
sequence: string
- name: correctness_count
dtype: int64
- name: messages
list:
- name: content
dtype: string
- name: role
dtype: string
splits:
- name: train
num_bytes: 4964543659
num_examples: 93733
download_size: 2149897914
dataset_size: 4964543659
- config_name: extended
features:
- name: problem
dtype: string
- name: solution
dtype: string
- name: answer
dtype: string
- name: problem_type
dtype: string
- name: question_type
dtype: string
- name: source
dtype: string
- name: uuid
dtype: string
- name: is_reasoning_complete
sequence: bool
- name: generations
sequence: string
- name: correctness_math_verify
sequence: bool
- name: correctness_llama
sequence: bool
- name: finish_reasons
sequence: string
- name: correctness_count
dtype: int64
- name: messages
list:
- name: content
dtype: string
- name: role
dtype: string
splits:
- name: train
num_bytes: 4769566550
num_examples: 131396
download_size: 2063936457
dataset_size: 4769566550
---
# OpenR1-Math-220k
## Dataset description
OpenR1-Math-220k is a large-scale dataset for mathematical reasoning. It consists of 220k math problems with two to four reasoning traces generated by [DeepSeek R1](https://huggingface.co/deepseek-ai/DeepSeek-R1) for problems from NuminaMath 1.5.
The traces were verified using [Math Verify](https://github.com/huggingface/Math-Verify) for most samples and [Llama-3.3-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct) as a judge for 12% of the samples, and each problem contains at least one reasoning trace with a correct answer.
The dataset consists of two splits:
- `default` with 94k problems and that achieves the best performance after SFT.
- `extended` with 131k samples where we add data sources like `cn_k12`. This provides more reasoning traces, but we found that the performance after SFT to be lower than the `default` subset, likely because the questions from `cn_k12` are less difficult than other sources.
You can load the dataset as follows:
```python
from datasets import load_dataset
ds = load_dataset("open-r1/OpenR1-Math-220k", "default")
```
## Dataset curation
To build OpenR1-Math-220k, we prompt [DeepSeek R1](https://huggingface.co/deepseek-ai/DeepSeek-R1) model to generate solutions for 400k problems from [NuminaMath 1.5](https://huggingface.co/datasets/AI-MO/NuminaMath-1.5) using [SGLang](https://github.com/sgl-project/sglang), the generation code is available [here](https://github.com/huggingface/open-r1/tree/main/slurm). We follow the model card’s recommended generation parameters and prepend the following instruction to the user prompt:
`"Please reason step by step, and put your final answer within \boxed{}."`
We set a 16k token limit per generation, as our analysis showed that only 75% of problems could be solved in under 8k tokens, and most of the remaining problems required the full 16k tokens. We were able to generate 25 solutions per hour per H100, enabling us to generate 300k problem solutions per day on 512 H100s.
We generate two solutions per problem—and in some cases, four—to provide flexibility in filtering and training. This approach allows for rejection sampling, similar to DeepSeek R1’s methodology, and also makes the dataset suitable for preference optimisation methods like DPO.
## License
The dataset is licensed under Apache 2.0
许可证:Apache-2.0
语言:
- 英语
配置项:
- 配置名称:all
数据文件:
- 拆分集:训练集
路径:all/train-*
- 配置名称:default
数据文件:
- 拆分集:训练集
路径:data/train-*
- 配置名称:extended
数据文件:
- 拆分集:训练集
路径:extended/train-*
数据集信息:
- 配置名称:all
特征字段:
- 字段名:problem,数据类型:字符串
- 字段名:solution,数据类型:字符串
- 字段名:answer,数据类型:字符串
- 字段名:problem_type,数据类型:字符串
- 字段名:question_type,数据类型:字符串
- 字段名:source,数据类型:字符串
- 字段名:uuid,数据类型:字符串
- 字段名:is_reasoning_complete,数据类型:布尔值序列
- 字段名:generations,数据类型:字符串序列
- 字段名:correctness_math_verify,数据类型:布尔值序列
- 字段名:correctness_llama,数据类型:布尔值序列
- 字段名:finish_reasons,数据类型:字符串序列
- 字段名:correctness_count,数据类型:int64
- 字段名:messages,数据类型:列表,列表元素包含:
- 字段名:content,数据类型:字符串
- 字段名:role,数据类型:字符串
拆分集:
- 名称:train,字节数:9734110026.0,样本数:225129
下载大小:4221672067,数据集总大小:9734110026.0
- 配置名称:default
特征字段:
- 字段名:problem,数据类型:字符串
- 字段名:solution,数据类型:字符串
- 字段名:answer,数据类型:字符串
- 字段名:problem_type,数据类型:字符串
- 字段名:question_type,数据类型:字符串
- 字段名:source,数据类型:字符串
- 字段名:uuid,数据类型:字符串
- 字段名:is_reasoning_complete,数据类型:布尔值序列
- 字段名:generations,数据类型:字符串序列
- 字段名:correctness_math_verify,数据类型:布尔值序列
- 字段名:correctness_llama,数据类型:布尔值序列
- 字段名:finish_reasons,数据类型:字符串序列
- 字段名:correctness_count,数据类型:int64
- 字段名:messages,数据类型:列表,列表元素包含:
- 字段名:content,数据类型:字符串
- 字段名:role,数据类型:字符串
拆分集:
- 名称:train,字节数:4964543659,样本数:93733
下载大小:2149897914,数据集总大小:4964543659
- 配置名称:extended
特征字段:
- 字段名:problem,数据类型:字符串
- 字段名:solution,数据类型:字符串
- 字段名:answer,数据类型:字符串
- 字段名:problem_type,数据类型:字符串
- 字段名:question_type,数据类型:字符串
- 字段名:source,数据类型:字符串
- 字段名:uuid,数据类型:字符串
- 字段名:is_reasoning_complete,数据类型:布尔值序列
- 字段名:generations,数据类型:字符串序列
- 字段名:correctness_math_verify,数据类型:布尔值序列
- 字段名:correctness_llama,数据类型:布尔值序列
- 字段名:finish_reasons,数据类型:字符串序列
- 字段名:correctness_count,数据类型:int64
- 字段名:messages,数据类型:列表,列表元素包含:
- 字段名:content,数据类型:字符串
- 字段名:role,数据类型:字符串
拆分集:
- 名称:train,字节数:4769566550,样本数:131396
下载大小:2063936457,数据集总大小:4769566550
---
# OpenR1-Math-220k
## 数据集说明
OpenR1-Math-220k是一款面向数学推理的大规模数据集,共包含22万道数学题,其推理轨迹均由DeepSeek R1(DeepSeek R1)针对NuminaMath 1.5(NuminaMath 1.5)数据集的题目生成,每道题附带2至4条推理轨迹。
针对绝大多数样本,我们使用Math Verify(Math Verify)对推理轨迹进行验证;对于12%的样本,则以Llama-3.3-70B-Instruct(Llama-3.3-70B-Instruct)作为评判模型进行验证。每道题目至少包含一条答案正确的推理轨迹。
本数据集包含两个拆分子集:
- `default` 子集包含9.4万道题目,经监督微调(Supervised Fine-Tuning,SFT)后可获得最优性能表现。
- `extended` 子集包含13.1万条样本,我们在此子集新增了`cn_k12`等数据源,从而提供了更多推理轨迹。但经测试,该子集经监督微调后的性能低于`default`子集,这可能是因为`cn_k12`来源的题目难度低于其他数据源。
你可以通过以下方式加载该数据集:
python
from datasets import load_dataset
ds = load_dataset("open-r1/OpenR1-Math-220k", "default")
## 数据集构建流程
为构建OpenR1-Math-220k数据集,我们通过SGLang(SGLang)调用DeepSeek R1(DeepSeek R1)模型,为来自NuminaMath 1.5(NuminaMath 1.5)的40万道题目生成解题方案,相关生成代码已公开于[此处](https://github.com/huggingface/open-r1/tree/main/slurm)。我们严格遵循模型卡片推荐的生成参数,并在用户提示词前添加如下指令:
`"请逐步进行推理,并将最终答案置于oxed{}中。"`
我们将单条生成的令牌上限设为16k,经分析发现仅有75%的题目可在8k令牌以内完成求解,剩余绝大多数题目则需要完整的16k令牌空间。单张H100图形处理器每小时可生成25条解题方案,在使用512张H100的情况下,单日可生成30万条题目解题方案。
我们为每道题目生成2条解题方案,部分题目则生成4条,以此为筛选与训练提供灵活性。该方案支持类似DeepSeek R1所用的拒绝采样方法,同时也适用于直接偏好优化(Direct Preference Optimization,DPO)等偏好优化训练方法。
## 许可证
本数据集采用Apache 2.0许可证进行开源授权。
提供机构:
lufish01



