dschauhan08/mega-reasoning-mix-normalized-v3

Name: dschauhan08/mega-reasoning-mix-normalized-v3
Creator: dschauhan08
Published: 2026-04-03 18:07:24
License: 暂无描述

Hugging Face2026-04-03 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/dschauhan08/mega-reasoning-mix-normalized-v3

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 language: - en tags: - reasoning - chain-of-thought - code-generation - tool-use - synthetic - instruction-tuning pretty_name: Mega Reasoning Mix Normalized v3 size_categories: - 1M<n<10M --- # 🧠 Mega Reasoning Mix Normalized v3 ## Dataset Summary The **Mega Reasoning Mix Normalized v3** is a highly curated, large-scale dataset designed for fine-tuning Large Language Models (LLMs) on complex reasoning, step-by-step Chain-of-Thought (CoT), advanced coding tasks, and agentic tool use. This dataset is the result of combining over a dozen top-tier synthetic and filtered reasoning datasets. The entire corpus has been strictly normalized into a unified schema and rigorously deduplicated to prevent data contamination and ensure high-quality training signals. ## 🛠️ Processing Pipeline & Curation To ensure the highest quality training data, this dataset was processed through a strict pipeline: 1. **Schema Normalization:** Data from highly diverse sources (standard prompts, conversational JSONL, raw terminal outputs) were mapped to a single, unified schema. 2. **Global Deduplication:** Every row was hashed (SHA-256) based on its core content (`input`, `output`, `reasoning`, `code`, `messages`, `tool_calls`). Exact duplicates across all combined datasets were dropped. 3. **Optimized Sharding:** The data is stored in partitioned `.parquet` shards (50k rows per train shard) for highly efficient, low-RAM streaming and processing during distributed training. ## 📊 Dataset Structure ### Data Splits The dataset is split using a deterministic hash-based routing method to ensure no data leakage between splits. * **`train`**: ~99% of the unique data. * **`validation`**: ~1% of the unique data. ### Schema (Features) Every row in the dataset contains the following normalized fields: | Field Name | Type | Description | | :--- | :--- | :--- | | `source_dataset` | `string` | The original Hugging Face repository name. | | `input` | `string` | The user's prompt, question, or instruction. | | `output` | `string` | The target response, answer, or completion. | | `reasoning` | `string` | The internal chain-of-thought, thinking process, or step-by-step rationale. | | `code` | `string` | Any extracted source code, scripts, or programmatic solutions. | | `messages` | `string` (JSON) | OpenAI-style conversational arrays `[{"role": "user", "content": "..."}]`. | | `tool_calls` | `string` (JSON) | Extracted function calls or tool-use parameters. | | `has_reasoning` | `bool` | Flag indicating if reasoning/CoT data is present. | | `has_tool_calls` | `bool` | Flag indicating if tool utilization is present. | | `has_code` | `bool` | Flag indicating if code blocks or logic are present. | | `is_chat` | `bool` | Flag indicating if the row represents a multi-turn conversation. | | `raw_original` | `string` (JSON) | A serialized dump of the original, unmodified row for auditing. | ## 📚 Source Datasets This mix is composed of the following rigorously selected datasets: **General & Advanced Reasoning:** * `nohurry/Opus-4.6-Reasoning-3000x-filtered` * `Crownelius/Opus-4.6-Reasoning-2100x-formatted` * `dalisoft/claude-opus-4.6-high-reasoning-700x` * `Roman1111111/claude-opus-4.6-10000x` * `Roman1111111/gemini-3.1-pro-hard-high-reasoning` * `Roman1111111/gpt-5.4-step-by-step-reasoning` * `Crownelius/GLM-5.0-8000x-formatted-fixed` * `Crownelius/Opus-4.5-3000x-formatted` * `Crownelius/Gemini-3-Pro-Opus-4.5-Kimi-K2.5-13000x-formatted` * `Magpie-Align/Magpie-Reasoning-150K` **Coding & Tool Use / Agentic:** * `ajibawa-2023/Code-Reasoning-Dataset` * `Madras1/minimax-m2.5-code-distilled-14k` * `AmanPriyanshu/tool-reasoning-sft-CODING-text_to_terminal_v2-sft-tool-use-agent-data-cleaned-rectified` * `Danau5tin/terminal-tasks` ## 🎯 Intended Use This dataset is highly recommended for: * **SFT (Supervised Fine-Tuning):** Teaching base models to think before they speak (yielding `<think>` or `<reasoning>` blocks). * **Agentic Frameworks:** Training models to navigate terminals, write code, and utilize external tools contextually. * **Distillation:** Transferring the reasoning capabilities of massive proprietary models (Opus, GPT-4, Gemini Pro) into smaller, efficient local parameters.

提供机构：

dschauhan08

5,000+

优质数据集

54 个

任务类型

进入经典数据集