five

OpenThoughts-114k-math

收藏
魔搭社区2026-01-06 更新2025-02-08 收录
下载链接:
https://modelscope.cn/datasets/open-r1/OpenThoughts-114k-math
下载链接
链接失效反馈
官方服务:
资源简介:
This is a filtered and metadata enriched version of [`open-thoughts/OpenThoughts-114k`](https://huggingface.co/datasets/open-thoughts/OpenThoughts-114k). While the original dataset is a valuable resource containing [DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1) outputs, it has very little metadata (only 2 fields: `system` and `conversations`). It does not contain, for instance, the original solution label, which means that we can not verify the model answers. ## What we did - filtered the dataset for math content (math questions were prefixed by "Return your final response within \\boxed{}." -- see [here](https://github.com/open-thoughts/open-thoughts/blob/main/open_thoughts/math/reason.py#L16C43-L16C90)) - found the original questions in the [`AI-MO/NuminaMath-CoT`](https://huggingface.co/datasets/AI-MO/NuminaMath-CoT) and mapped them back to each generation - verified model generations using our [Math-Verify library](https://github.com/huggingface/Math-Verify) - added a metadata field with the token count of each DeepSeek-R1 completion ## Data structure - `source`: original `source` from Numina-Math - `problem`: problem statement, from Numina-Math - `solution`: original solution/gold label, from Numina-Math - `messages`: message turns for finetuning on the correct solutions, from Numina-Math - `system`: system prompt sent to DeepSeek-R1, from OpenThoughts - `conversations`: message turns from the DeepSeek-R1 generation. The last turn is the model output, from OpenThoughts - `generated_token_count`: number of tokens (counted using the DeepSeek-R1 tokenizer) of the model output. - `correct`: label indicating if the DeepSeek-R1 generated solution matches the ground truth `solution`. Checked with [Math-Verify library](https://github.com/huggingface/Math-Verify) ## Some statistics - The original OpenThoughts-114k dataset has **89120/113957 (78%)** math rows - Of those, **56730/89120 (63%)** have correct answers, as checked by Math-Verify - There is a single generation per question - Token count distribution: mean=6366.67, std_dev=4662.88 tokens ![image/png](https://cdn-uploads.huggingface.co/production/uploads/62596f9e1c0a084224b93e00/aPYBSni3Ft6VK1VJkExtS.png)

## 数据集概述 本数据集是对 [`open-thoughts/OpenThoughts-114k`](https://huggingface.co/datasets/open-thoughts/OpenThoughts-114k) 进行过滤并补充元数据后的版本。 原始数据集作为包含[DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1)输出的宝贵研究资源,但其元数据极为匮乏,仅包含`system`和`conversations`两个字段。例如,其缺失原始解答标签,导致无法验证模型生成答案的正确性。 ## 数据处理流程 - 对数据集进行数学内容过滤:原始数学问题均以"Return your final response within \boxed{}."作为前缀(详见[此处](https://github.com/open-thoughts/open-thoughts/blob/main/open_thoughts/math/reason.py#L16C43-L16C90)) - 从[`AI-MO/NuminaMath-CoT`](https://huggingface.co/datasets/AI-MO/NuminaMath-CoT)中检索原始问题,并将其与每条模型生成结果进行一一映射 - 使用我们的[Math-Verify库](https://github.com/huggingface/Math-Verify)对模型生成内容进行正确性验证 - 新增元数据字段,用于记录每条DeepSeek-R1生成结果的Token数量 ## 数据结构 各字段说明如下: - `source`:源自Numina-Math的原始`source`字段 - `problem`:源自Numina-Math的问题描述文本 - `solution`:源自Numina-Math的原始解答/标准答案 - `messages`:源自Numina-Math的、用于基于正确解答进行微调的对话轮次 - `system`:源自OpenThoughts的、发送至DeepSeek-R1的系统提示词 - `conversations`:源自OpenThoughts的、DeepSeek-R1生成的对话轮次,最后一轮为模型输出内容 - `generated_token_count`:模型输出内容的Token总数(采用DeepSeek-R1的Tokenizer进行统计) - `correct`:用于标识DeepSeek-R1生成的解答是否与标准答案`solution`一致的标签,通过[Math-Verify库](https://github.com/huggingface/Math-Verify)完成校验 ## 统计信息 - 原始OpenThoughts-114k数据集中共有**89120/113957(78%)**条数学相关数据 - 其中经Math-Verify校验后,**56730/89120(63%)**的模型生成结果为正确解答 - 每个问题仅对应一条模型生成结果 - Token数量分布:均值=6366.67,标准差=4662.88 Tokens ![image/png](https://cdn-uploads.huggingface.co/production/uploads/62596f9e1c0a084224b93e00/aPYBSni3Ft6VK1VJkExtS.png)
提供机构:
maas
创建时间:
2025-02-11
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作