five

lufish01/OpenR1-Math-220k

收藏
Hugging Face2026-03-14 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/lufish01/OpenR1-Math-220k
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 language: - en configs: - config_name: all data_files: - split: train path: all/train-* - config_name: default data_files: - split: train path: data/train-* - config_name: extended data_files: - split: train path: extended/train-* dataset_info: - config_name: all features: - name: problem dtype: string - name: solution dtype: string - name: answer dtype: string - name: problem_type dtype: string - name: question_type dtype: string - name: source dtype: string - name: uuid dtype: string - name: is_reasoning_complete sequence: bool - name: generations sequence: string - name: correctness_math_verify sequence: bool - name: correctness_llama sequence: bool - name: finish_reasons sequence: string - name: correctness_count dtype: int64 - name: messages list: - name: content dtype: string - name: role dtype: string splits: - name: train num_bytes: 9734110026.0 num_examples: 225129 download_size: 4221672067 dataset_size: 9734110026.0 - config_name: default features: - name: problem dtype: string - name: solution dtype: string - name: answer dtype: string - name: problem_type dtype: string - name: question_type dtype: string - name: source dtype: string - name: uuid dtype: string - name: is_reasoning_complete sequence: bool - name: generations sequence: string - name: correctness_math_verify sequence: bool - name: correctness_llama sequence: bool - name: finish_reasons sequence: string - name: correctness_count dtype: int64 - name: messages list: - name: content dtype: string - name: role dtype: string splits: - name: train num_bytes: 4964543659 num_examples: 93733 download_size: 2149897914 dataset_size: 4964543659 - config_name: extended features: - name: problem dtype: string - name: solution dtype: string - name: answer dtype: string - name: problem_type dtype: string - name: question_type dtype: string - name: source dtype: string - name: uuid dtype: string - name: is_reasoning_complete sequence: bool - name: generations sequence: string - name: correctness_math_verify sequence: bool - name: correctness_llama sequence: bool - name: finish_reasons sequence: string - name: correctness_count dtype: int64 - name: messages list: - name: content dtype: string - name: role dtype: string splits: - name: train num_bytes: 4769566550 num_examples: 131396 download_size: 2063936457 dataset_size: 4769566550 --- # OpenR1-Math-220k ## Dataset description OpenR1-Math-220k is a large-scale dataset for mathematical reasoning. It consists of 220k math problems with two to four reasoning traces generated by [DeepSeek R1](https://huggingface.co/deepseek-ai/DeepSeek-R1) for problems from NuminaMath 1.5. The traces were verified using [Math Verify](https://github.com/huggingface/Math-Verify) for most samples and [Llama-3.3-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct) as a judge for 12% of the samples, and each problem contains at least one reasoning trace with a correct answer. The dataset consists of two splits: - `default` with 94k problems and that achieves the best performance after SFT. - `extended` with 131k samples where we add data sources like `cn_k12`. This provides more reasoning traces, but we found that the performance after SFT to be lower than the `default` subset, likely because the questions from `cn_k12` are less difficult than other sources. You can load the dataset as follows: ```python from datasets import load_dataset ds = load_dataset("open-r1/OpenR1-Math-220k", "default") ``` ## Dataset curation To build OpenR1-Math-220k, we prompt [DeepSeek R1](https://huggingface.co/deepseek-ai/DeepSeek-R1) model to generate solutions for 400k problems from [NuminaMath 1.5](https://huggingface.co/datasets/AI-MO/NuminaMath-1.5) using [SGLang](https://github.com/sgl-project/sglang), the generation code is available [here](https://github.com/huggingface/open-r1/tree/main/slurm). We follow the model card’s recommended generation parameters and prepend the following instruction to the user prompt: `"Please reason step by step, and put your final answer within \boxed{}."` We set a 16k token limit per generation, as our analysis showed that only 75% of problems could be solved in under 8k tokens, and most of the remaining problems required the full 16k tokens. We were able to generate 25 solutions per hour per H100, enabling us to generate 300k problem solutions per day on 512 H100s. We generate two solutions per problem—and in some cases, four—to provide flexibility in filtering and training. This approach allows for rejection sampling, similar to DeepSeek R1’s methodology, and also makes the dataset suitable for preference optimisation methods like DPO. ## License The dataset is licensed under Apache 2.0

许可证:Apache-2.0 语言: - 英语 配置项: - 配置名称:all 数据文件: - 拆分集:训练集 路径:all/train-* - 配置名称:default 数据文件: - 拆分集:训练集 路径:data/train-* - 配置名称:extended 数据文件: - 拆分集:训练集 路径:extended/train-* 数据集信息: - 配置名称:all 特征字段: - 字段名:problem,数据类型:字符串 - 字段名:solution,数据类型:字符串 - 字段名:answer,数据类型:字符串 - 字段名:problem_type,数据类型:字符串 - 字段名:question_type,数据类型:字符串 - 字段名:source,数据类型:字符串 - 字段名:uuid,数据类型:字符串 - 字段名:is_reasoning_complete,数据类型:布尔值序列 - 字段名:generations,数据类型:字符串序列 - 字段名:correctness_math_verify,数据类型:布尔值序列 - 字段名:correctness_llama,数据类型:布尔值序列 - 字段名:finish_reasons,数据类型:字符串序列 - 字段名:correctness_count,数据类型:int64 - 字段名:messages,数据类型:列表,列表元素包含: - 字段名:content,数据类型:字符串 - 字段名:role,数据类型:字符串 拆分集: - 名称:train,字节数:9734110026.0,样本数:225129 下载大小:4221672067,数据集总大小:9734110026.0 - 配置名称:default 特征字段: - 字段名:problem,数据类型:字符串 - 字段名:solution,数据类型:字符串 - 字段名:answer,数据类型:字符串 - 字段名:problem_type,数据类型:字符串 - 字段名:question_type,数据类型:字符串 - 字段名:source,数据类型:字符串 - 字段名:uuid,数据类型:字符串 - 字段名:is_reasoning_complete,数据类型:布尔值序列 - 字段名:generations,数据类型:字符串序列 - 字段名:correctness_math_verify,数据类型:布尔值序列 - 字段名:correctness_llama,数据类型:布尔值序列 - 字段名:finish_reasons,数据类型:字符串序列 - 字段名:correctness_count,数据类型:int64 - 字段名:messages,数据类型:列表,列表元素包含: - 字段名:content,数据类型:字符串 - 字段名:role,数据类型:字符串 拆分集: - 名称:train,字节数:4964543659,样本数:93733 下载大小:2149897914,数据集总大小:4964543659 - 配置名称:extended 特征字段: - 字段名:problem,数据类型:字符串 - 字段名:solution,数据类型:字符串 - 字段名:answer,数据类型:字符串 - 字段名:problem_type,数据类型:字符串 - 字段名:question_type,数据类型:字符串 - 字段名:source,数据类型:字符串 - 字段名:uuid,数据类型:字符串 - 字段名:is_reasoning_complete,数据类型:布尔值序列 - 字段名:generations,数据类型:字符串序列 - 字段名:correctness_math_verify,数据类型:布尔值序列 - 字段名:correctness_llama,数据类型:布尔值序列 - 字段名:finish_reasons,数据类型:字符串序列 - 字段名:correctness_count,数据类型:int64 - 字段名:messages,数据类型:列表,列表元素包含: - 字段名:content,数据类型:字符串 - 字段名:role,数据类型:字符串 拆分集: - 名称:train,字节数:4769566550,样本数:131396 下载大小:2063936457,数据集总大小:4769566550 --- # OpenR1-Math-220k ## 数据集说明 OpenR1-Math-220k是一款面向数学推理的大规模数据集,共包含22万道数学题,其推理轨迹均由DeepSeek R1(DeepSeek R1)针对NuminaMath 1.5(NuminaMath 1.5)数据集的题目生成,每道题附带2至4条推理轨迹。 针对绝大多数样本,我们使用Math Verify(Math Verify)对推理轨迹进行验证;对于12%的样本,则以Llama-3.3-70B-Instruct(Llama-3.3-70B-Instruct)作为评判模型进行验证。每道题目至少包含一条答案正确的推理轨迹。 本数据集包含两个拆分子集: - `default` 子集包含9.4万道题目,经监督微调(Supervised Fine-Tuning,SFT)后可获得最优性能表现。 - `extended` 子集包含13.1万条样本,我们在此子集新增了`cn_k12`等数据源,从而提供了更多推理轨迹。但经测试,该子集经监督微调后的性能低于`default`子集,这可能是因为`cn_k12`来源的题目难度低于其他数据源。 你可以通过以下方式加载该数据集: python from datasets import load_dataset ds = load_dataset("open-r1/OpenR1-Math-220k", "default") ## 数据集构建流程 为构建OpenR1-Math-220k数据集,我们通过SGLang(SGLang)调用DeepSeek R1(DeepSeek R1)模型,为来自NuminaMath 1.5(NuminaMath 1.5)的40万道题目生成解题方案,相关生成代码已公开于[此处](https://github.com/huggingface/open-r1/tree/main/slurm)。我们严格遵循模型卡片推荐的生成参数,并在用户提示词前添加如下指令: `"请逐步进行推理,并将最终答案置于oxed{}中。"` 我们将单条生成的令牌上限设为16k,经分析发现仅有75%的题目可在8k令牌以内完成求解,剩余绝大多数题目则需要完整的16k令牌空间。单张H100图形处理器每小时可生成25条解题方案,在使用512张H100的情况下,单日可生成30万条题目解题方案。 我们为每道题目生成2条解题方案,部分题目则生成4条,以此为筛选与训练提供灵活性。该方案支持类似DeepSeek R1所用的拒绝采样方法,同时也适用于直接偏好优化(Direct Preference Optimization,DPO)等偏好优化训练方法。 ## 许可证 本数据集采用Apache 2.0许可证进行开源授权。
提供机构:
lufish01
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作