OpenThoughts-114k-math-open-r1

Name: OpenThoughts-114k-math-open-r1
Creator: maas
Published: 2025-11-12 16:22:56
License: 暂无描述

魔搭社区2025-11-12 更新2025-02-15 收录

下载链接：

https://modelscope.cn/datasets/AI-ModelScope/OpenThoughts-114k-math-open-r1

下载链接

链接失效反馈

官方服务：

资源简介：

This is a filtered and metadata enriched version of [`open-thoughts/OpenThoughts-114k`](https://huggingface.co/datasets/open-thoughts/OpenThoughts-114k). While the original dataset is a valuable resource containing [DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1) outputs, it has very little metadata (only 2 fields: `system` and `conversations`). It does not contain, for instance, the original solution label, which means that we can not verify the model answers. ## What we did - filtered the dataset for math content (math questions were prefixed by "Return your final response within \\boxed{}." -- see [here](https://github.com/open-thoughts/open-thoughts/blob/main/open_thoughts/math/reason.py#L16C43-L16C90)) - found the original questions in the [`AI-MO/NuminaMath-CoT`](https://huggingface.co/datasets/AI-MO/NuminaMath-CoT) and mapped them back to each generation - verified model generations using our [Math-Verify library](https://github.com/huggingface/Math-Verify) - added a metadata field with the token count of each DeepSeek-R1 completion ## Data structure - `source`: original `source` from Numina-Math - `problem`: problem statement, from Numina-Math - `solution`: original solution/gold label, from Numina-Math - `messages`: message turns for finetuning on the correct solutions, from Numina-Math - `system`: system prompt sent to DeepSeek-R1, from OpenThoughts - `conversations`: message turns from the DeepSeek-R1 generation. The last turn is the model output, from OpenThoughts - `generated_token_count`: number of tokens (counted using the DeepSeek-R1 tokenizer) of the model output. - `correct`: label indicating if the DeepSeek-R1 generated solution matches the ground truth `solution`. Checked with [Math-Verify library](https://github.com/huggingface/Math-Verify) ## Some statistics - The original OpenThoughts-114k dataset has **89120/113957 (78%)** math rows - Of those, **56730/89120 (63%)** have correct answers, as checked by Math-Verify - There is a single generation per question - Token count distribution: mean=6366.67, std_dev=4662.88 tokens ![image/png](https://cdn-uploads.huggingface.co/production/uploads/62596f9e1c0a084224b93e00/aPYBSni3Ft6VK1VJkExtS.png)

本数据集是对 [`open-thoughts/OpenThoughts-114k`](https://huggingface.co/datasets/open-thoughts/OpenThoughts-114k) 进行过滤与元数据增强后的版本。原始数据集作为包含[DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1)输出的优质资源，仅自带极少量元数据（仅含`system`与`conversations`两个字段），例如缺少原始解答标签，导致无法验证模型生成答案的正确性。 ### 我们所开展的工作 - 针对数学类内容完成数据集过滤：数学题目均以“Return your final response within oxed{}.”作为前缀（详见[此处](https://github.com/open-thoughts/open-thoughts/blob/main/open_thoughts/math/reason.py#L16C43-L16C90)） - 从 [`AI-MO/NuminaMath-CoT`](https://huggingface.co/datasets/AI-MO/NuminaMath-CoT) 数据集内获取原始问题，并将其与每条模型生成结果进行映射关联 - 借助我们开发的[Math-Verify工具库](https://github.com/huggingface/Math-Verify)对模型生成内容进行正确性校验 - 新增元数据字段，用于统计并记录每条DeepSeek-R1生成结果的Token数量 ### 数据结构 - `source`：源自Numina-Math的原始来源字段 - `problem`：源自Numina-Math的问题题干 - `solution`：源自Numina-Math的原始标准解答（金标签） - `messages`：用于针对正确解答进行微调的对话轮次数据，源自Numina-Math - `system`：发送给DeepSeek-R1的系统提示词，源自OpenThoughts数据集 - `conversations`：DeepSeek-R1生成的对话轮次数据，最后一轮为模型输出结果，源自OpenThoughts数据集 - `generated_token_count`：模型输出结果的Token数量（采用DeepSeek-R1对应的Tokenizer进行统计） - `correct`：用于标识DeepSeek-R1生成的解答是否与基准`solution`一致的标签，通过[Math-Verify工具库](https://github.com/huggingface/Math-Verify)完成校验 ### 统计信息 - 原始OpenThoughts-114k数据集中共有**89120/113957（78%）**条数学相关条目 - 其中**56730/89120（63%）**的条目经Math-Verify校验后拥有正确答案 - 每个问题仅对应一条模型生成结果 - Token数量分布：均值为6366.67，标准差为4662.88个Token ![image/png](https://cdn-uploads.huggingface.co/production/uploads/62596f9e1c0a084224b93e00/aPYBSni3Ft6VK1VJkExtS.png)

提供机构：

maas

创建时间：

2025-02-11

5,000+

优质数据集

54 个

任务类型

进入经典数据集