five

OpenO1-SFT-Ultra

收藏
魔搭社区2025-12-04 更新2025-06-14 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/OpenO1-SFT-Ultra
下载链接
链接失效反馈
官方服务:
资源简介:
# openo1-sft-ultra-35m-data ## Instruction We have released the openo1-sft-ultra-35m-data, which contains 35 million data points. It is based on existing open-source datasets and synthesized using the openo1-qwen-sft model. We first collected open-source datasets and then annotated the data based on difficulty, quality, and question types using the qwen-2.5-72b-instruct model. To ensure the difficulty and quality of the data, we only retained data where both the difficulty and quality are ≥8. ## Data format - 'uid': Data ID - 'query': Original data query - 'response': Long COT response, including detailed thought process - 'source': Data source - 'difficulty': Question difficulty, range from 1 to 10 - 'quality': Data quality, range from 1 to 10 - 'answer': Ground truth answer of the data - 'query_len': Length of the question - 'response_len': Length of the answer - 'Topic': Data topic category, including Math | Code | Reasoning - 'answer_type': Answer type annotation - For math domain: - a: Purely numerical answer - b: Purely formulaic answer - c: Long textual explanation - For code domain: - a: Answer that includes code - b: Answer that contains only code-related text - For reasoning domain: - a: Answer consisting of only words/phrases - b: Answer that includes a long textual explanation - "cases": Test cases included in the code ## Data statistics ### Data source We have used the following sources for the data: - WebInstructFull - homework - infinity-instruct - math-stack-exchange - MathInstruct - mcq ![image/png](https://cdn-uploads.huggingface.co/production/uploads/65d2251f98b4a470bf6a26e3/27aLuIiGoc4onBxkxq2qV.png) <!-- Insert statistical charts for different sources here -->

# openo1-sft-ultra-35m-data ## 数据集说明 我们现已发布openo1-sft-ultra-35m-data数据集,该数据集共包含3500万条数据样本。本数据集基于现有开源数据集构建,并通过openo1-qwen-sft模型生成合成数据。我们首先收集开源数据集,随后使用qwen-2.5-72b-instruct模型,基于难度、质量与题型对数据进行标注。为保障数据的难度与质量,我们仅保留难度与质量评分均≥8的样本。 ## 数据格式 - `'uid'`:数据ID - `'query'`:原始数据查询内容 - `'response'`:长链式思考(Chain of Thought, COT)响应,包含完整推理过程 - `'source'`:数据来源 - `'difficulty'`:题目难度评分,取值范围为1至10 - `'quality'`:数据质量评分,取值范围为1至10 - `'answer'`:数据集的标准答案(Ground Truth Answer) - `'query_len'`:问题文本长度 - `'response_len'`:响应文本长度 - `'Topic'`:数据主题类别,涵盖数学(Math)、代码(Code)与推理(Reasoning)三大领域 - `'answer_type'`:答案类型标注,按领域细分如下: - 数学领域: - a:纯数值型答案 - b:纯公式型答案 - c:长文本解释型答案 - 代码领域: - a:包含代码的答案 - b:仅包含代码相关文本的答案 - 推理领域: - a:仅由单词/短语组成的答案 - b:包含长文本解释的答案 - `"cases"`:代码中包含的测试用例 ## 数据统计 ### 数据来源 本数据集采用以下数据源: - WebInstructFull - homework - infinity-instruct - math-stack-exchange - MathInstruct - mcq ![image/png](https://cdn-uploads.huggingface.co/production/uploads/65d2251f98b4a470bf6a26e3/27aLuIiGoc4onBxkxq2qV.png) <!-- 此处插入不同数据源的统计图表 -->
提供机构:
maas
创建时间:
2024-12-28
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作