OpenThoughts-114k-math
收藏魔搭社区2026-01-06 更新2025-02-08 收录
下载链接:
https://modelscope.cn/datasets/open-r1/OpenThoughts-114k-math
下载链接
链接失效反馈官方服务:
资源简介:
This is a filtered and metadata enriched version of [`open-thoughts/OpenThoughts-114k`](https://huggingface.co/datasets/open-thoughts/OpenThoughts-114k).
While the original dataset is a valuable resource containing [DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1) outputs, it has very little metadata (only 2 fields: `system` and `conversations`). It does not contain, for instance, the original solution label, which means that we can not verify the model answers.
## What we did
- filtered the dataset for math content (math questions were prefixed by "Return your final response within \\boxed{}." -- see [here](https://github.com/open-thoughts/open-thoughts/blob/main/open_thoughts/math/reason.py#L16C43-L16C90))
- found the original questions in the [`AI-MO/NuminaMath-CoT`](https://huggingface.co/datasets/AI-MO/NuminaMath-CoT) and mapped them back to each generation
- verified model generations using our [Math-Verify library](https://github.com/huggingface/Math-Verify)
- added a metadata field with the token count of each DeepSeek-R1 completion
## Data structure
- `source`: original `source` from Numina-Math
- `problem`: problem statement, from Numina-Math
- `solution`: original solution/gold label, from Numina-Math
- `messages`: message turns for finetuning on the correct solutions, from Numina-Math
- `system`: system prompt sent to DeepSeek-R1, from OpenThoughts
- `conversations`: message turns from the DeepSeek-R1 generation. The last turn is the model output, from OpenThoughts
- `generated_token_count`: number of tokens (counted using the DeepSeek-R1 tokenizer) of the model output.
- `correct`: label indicating if the DeepSeek-R1 generated solution matches the ground truth `solution`. Checked with [Math-Verify library](https://github.com/huggingface/Math-Verify)
## Some statistics
- The original OpenThoughts-114k dataset has **89120/113957 (78%)** math rows
- Of those, **56730/89120 (63%)** have correct answers, as checked by Math-Verify
- There is a single generation per question
- Token count distribution: mean=6366.67, std_dev=4662.88 tokens

## 数据集概述
本数据集是对 [`open-thoughts/OpenThoughts-114k`](https://huggingface.co/datasets/open-thoughts/OpenThoughts-114k) 进行过滤并补充元数据后的版本。
原始数据集作为包含[DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1)输出的宝贵研究资源,但其元数据极为匮乏,仅包含`system`和`conversations`两个字段。例如,其缺失原始解答标签,导致无法验证模型生成答案的正确性。
## 数据处理流程
- 对数据集进行数学内容过滤:原始数学问题均以"Return your final response within \boxed{}."作为前缀(详见[此处](https://github.com/open-thoughts/open-thoughts/blob/main/open_thoughts/math/reason.py#L16C43-L16C90))
- 从[`AI-MO/NuminaMath-CoT`](https://huggingface.co/datasets/AI-MO/NuminaMath-CoT)中检索原始问题,并将其与每条模型生成结果进行一一映射
- 使用我们的[Math-Verify库](https://github.com/huggingface/Math-Verify)对模型生成内容进行正确性验证
- 新增元数据字段,用于记录每条DeepSeek-R1生成结果的Token数量
## 数据结构
各字段说明如下:
- `source`:源自Numina-Math的原始`source`字段
- `problem`:源自Numina-Math的问题描述文本
- `solution`:源自Numina-Math的原始解答/标准答案
- `messages`:源自Numina-Math的、用于基于正确解答进行微调的对话轮次
- `system`:源自OpenThoughts的、发送至DeepSeek-R1的系统提示词
- `conversations`:源自OpenThoughts的、DeepSeek-R1生成的对话轮次,最后一轮为模型输出内容
- `generated_token_count`:模型输出内容的Token总数(采用DeepSeek-R1的Tokenizer进行统计)
- `correct`:用于标识DeepSeek-R1生成的解答是否与标准答案`solution`一致的标签,通过[Math-Verify库](https://github.com/huggingface/Math-Verify)完成校验
## 统计信息
- 原始OpenThoughts-114k数据集中共有**89120/113957(78%)**条数学相关数据
- 其中经Math-Verify校验后,**56730/89120(63%)**的模型生成结果为正确解答
- 每个问题仅对应一条模型生成结果
- Token数量分布:均值=6366.67,标准差=4662.88 Tokens

提供机构:
maas
创建时间:
2025-02-11



