Demeter-LongCoT-400K

Name: Demeter-LongCoT-400K
Creator: maas
Published: 2025-12-03 17:17:25
License: 暂无描述

魔搭社区2025-12-03 更新2025-12-06 收录

下载链接：

https://modelscope.cn/datasets/prithivMLmods/Demeter-LongCoT-400K

下载链接

链接失效反馈

官方服务：

资源简介：

![vNbx9I-PriQyvwMgaqd-Z.png](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/rl_NkH5DCrBOLlyjJzwW_.png) # **Demeter-LongCoT-400K** > **Demeter-LongCoT-400K** is a high-quality, compact chain-of-thought reasoning dataset curated for tasks in mathematics, science, and coding. While the dataset spans diverse domains, it is primarily driven by mathematical reasoning, reflecting a major share of math-focused prompts and long-form logical solutions. ## Quick Start with Hugging Face Datasets🤗 ```py pip install -U datasets ``` ```py from datasets import load_dataset dataset = load_dataset("prithivMLmods/Demeter-LongCoT-400K", split="train") ``` ## Overview * **Total Samples**: \~400,000 * **Split**: `train` only * **Languages**: English * **Format**: Apache Arrow (auto-converted to Parquet) * **License**: Apache-2.0 * **Tags**: `math`, `code`, `science`, `reasoning`, `longcot` ## Highlights * Structured to promote **long-form, step-by-step reasoning**, ideal for training and evaluating chain-of-thought (CoT) capable models. * Reasoning traces include natural, human-like explanations for both simple and complex problems. * Fine-tuned across math word problems, logic-based questions, and technical prompts from STEM domains. ## Dataset Structure Each entry in the dataset includes: * **`problem`** (string): A math, science, or code problem. * **`solution`** (string): A detailed step-by-step solution crafted in a long-form reasoning style. The reasoning structure in solutions helps models understand logical flow, intermediate steps, and layered deductions—making this dataset suitable for advanced LLMs requiring interpretable outputs. ## Source & Derivation Demeter-LongCoT-400K is a **random seed subset** derived from: * [Demeter-LongCoT-6M](https://huggingface.co/datasets/prithivMLmods/Demeter-LongCoT-6M) (\~6.4M samples). * Curated to \~400K entries while maintaining diverse coverage across domains. * Generated from a custom internal modular dataset tailored for logical and numeric reasoning tasks, with chain-of-thought style responses produced by QwQ 32B-based models. This dataset was created with a focus on enhancing CoT capabilities in large-scale models working on math, science, and code. ## License Apache License 2.0

![vNbx9I-PriQyvwMgaqd-Z.png](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/rl_NkH5DCrBOLlyjJzwW_.png) # **Demeter-LongCoT-400K** > **Demeter-LongCoT-400K** 是一款高质量、轻量化的思维链（chain-of-thought, CoT）推理数据集，专为数学、科学与编码类任务打造。尽管该数据集覆盖多元领域，但核心以数学推理为主，占比最高的提示词与长格式逻辑解答均围绕数学场景展开。 ## Hugging Face 数据集快速入门🤗 py pip install -U datasets py from datasets import load_dataset dataset = load_dataset("prithivMLmods/Demeter-LongCoT-400K", split="train") ## 数据集概览 * **总样本量**: 约400,000 * **数据集拆分**: 仅包含训练集（`train`） * **语言**: 英语 * **数据格式**: Apache Arrow格式（自动转换为Parquet格式） * **许可证**: Apache-2.0 * **标签**: `数学`, `代码`, `科学`, `推理`, `longcot` ## 数据集亮点 * 数据集设计旨在促进**长格式、逐步式推理**，非常适合训练与评估具备思维链（chain-of-thought, CoT）能力的模型。 * 推理轨迹涵盖针对简单与复杂问题的自然类人解释。 * 数据集覆盖STEM领域的数学应用题、逻辑推理题与技术类提示词，并完成了微调适配。 ## 数据集结构数据集中的每条样本包含以下字段： * **`problem`**（字符串类型）：数学、科学或编码类问题。 * **`solution`**（字符串类型）：采用长格式推理风格编写的详细分步解答。解答中的推理结构可帮助模型理解逻辑流程、中间步骤与层级演绎，因此该数据集非常适合需要可解释输出的高级大语言模型（Large Language Model, LLM）。 ## 数据集来源与衍生说明 Demeter-LongCoT-400K 是从以下数据集抽取的**随机种子子集**： * [Demeter-LongCoT-6M](https://huggingface.co/datasets/prithivMLmods/Demeter-LongCoT-6M)（约640万条样本）。 * 经筛选后保留约40万条样本，同时维持各领域的覆盖多样性。 * 数据集源自专为逻辑与数值推理任务定制的内部模块化数据集，其思维链风格的回复由基于QwQ 32B的模型生成。本数据集的构建目标是提升面向数学、科学与编码任务的大规模模型的思维链能力。 ## 许可证 Apache许可证2.0

提供机构：

maas

创建时间：

2025-08-24

5,000+

优质数据集

54 个

任务类型

进入经典数据集