OpenO1-SFT-Ultra

Name: OpenO1-SFT-Ultra
Creator: maas
Published: 2025-12-04 16:19:38
License: 暂无描述

魔搭社区2025-12-04 更新2025-06-14 收录

下载链接：

https://modelscope.cn/datasets/AI-ModelScope/OpenO1-SFT-Ultra

下载链接

链接失效反馈

官方服务：

资源简介：

# openo1-sft-ultra-35m-data ## Instruction We have released the openo1-sft-ultra-35m-data, which contains 35 million data points. It is based on existing open-source datasets and synthesized using the openo1-qwen-sft model. We first collected open-source datasets and then annotated the data based on difficulty, quality, and question types using the qwen-2.5-72b-instruct model. To ensure the difficulty and quality of the data, we only retained data where both the difficulty and quality are ≥8. ## Data format - 'uid': Data ID - 'query': Original data query - 'response': Long COT response, including detailed thought process - 'source': Data source - 'difficulty': Question difficulty, range from 1 to 10 - 'quality': Data quality, range from 1 to 10 - 'answer': Ground truth answer of the data - 'query_len': Length of the question - 'response_len': Length of the answer - 'Topic': Data topic category, including Math | Code | Reasoning - 'answer_type': Answer type annotation - For math domain: - a: Purely numerical answer - b: Purely formulaic answer - c: Long textual explanation - For code domain: - a: Answer that includes code - b: Answer that contains only code-related text - For reasoning domain: - a: Answer consisting of only words/phrases - b: Answer that includes a long textual explanation - "cases": Test cases included in the code ## Data statistics ### Data source We have used the following sources for the data: - WebInstructFull - homework - infinity-instruct - math-stack-exchange - MathInstruct - mcq ![image/png](https://cdn-uploads.huggingface.co/production/uploads/65d2251f98b4a470bf6a26e3/27aLuIiGoc4onBxkxq2qV.png)

# openo1-sft-ultra-35m-data ## 数据集说明我们现已发布openo1-sft-ultra-35m-data数据集，该数据集共包含3500万条数据样本。本数据集基于现有开源数据集构建，并通过openo1-qwen-sft模型生成合成数据。我们首先收集开源数据集，随后使用qwen-2.5-72b-instruct模型，基于难度、质量与题型对数据进行标注。为保障数据的难度与质量，我们仅保留难度与质量评分均≥8的样本。 ## 数据格式 - `'uid'`：数据ID - `'query'`：原始数据查询内容 - `'response'`：长链式思考（Chain of Thought, COT）响应，包含完整推理过程 - `'source'`：数据来源 - `'difficulty'`：题目难度评分，取值范围为1至10 - `'quality'`：数据质量评分，取值范围为1至10 - `'answer'`：数据集的标准答案（Ground Truth Answer） - `'query_len'`：问题文本长度 - `'response_len'`：响应文本长度 - `'Topic'`：数据主题类别，涵盖数学（Math）、代码（Code）与推理（Reasoning）三大领域 - `'answer_type'`：答案类型标注，按领域细分如下： - 数学领域： - a：纯数值型答案 - b：纯公式型答案 - c：长文本解释型答案 - 代码领域： - a：包含代码的答案 - b：仅包含代码相关文本的答案 - 推理领域： - a：仅由单词/短语组成的答案 - b：包含长文本解释的答案 - `"cases"`：代码中包含的测试用例 ## 数据统计 ### 数据来源本数据集采用以下数据源： - WebInstructFull - homework - infinity-instruct - math-stack-exchange - MathInstruct - mcq ![image/png](https://cdn-uploads.huggingface.co/production/uploads/65d2251f98b4a470bf6a26e3/27aLuIiGoc4onBxkxq2qV.png)

提供机构：

maas

创建时间：

2024-12-28

5,000+

优质数据集

54 个

任务类型

进入经典数据集