dschauhan08/mega-reasoning-mix-normalized-v3
收藏Hugging Face2026-04-03 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/dschauhan08/mega-reasoning-mix-normalized-v3
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
language:
- en
tags:
- reasoning
- chain-of-thought
- code-generation
- tool-use
- synthetic
- instruction-tuning
pretty_name: Mega Reasoning Mix Normalized v3
size_categories:
- 1M<n<10M
---
# 🧠 Mega Reasoning Mix Normalized v3
## Dataset Summary
The **Mega Reasoning Mix Normalized v3** is a highly curated, large-scale dataset designed for fine-tuning Large Language Models (LLMs) on complex reasoning, step-by-step Chain-of-Thought (CoT), advanced coding tasks, and agentic tool use.
This dataset is the result of combining over a dozen top-tier synthetic and filtered reasoning datasets. The entire corpus has been strictly normalized into a unified schema and rigorously deduplicated to prevent data contamination and ensure high-quality training signals.
## 🛠️ Processing Pipeline & Curation
To ensure the highest quality training data, this dataset was processed through a strict pipeline:
1. **Schema Normalization:** Data from highly diverse sources (standard prompts, conversational JSONL, raw terminal outputs) were mapped to a single, unified schema.
2. **Global Deduplication:** Every row was hashed (SHA-256) based on its core content (`input`, `output`, `reasoning`, `code`, `messages`, `tool_calls`). Exact duplicates across all combined datasets were dropped.
3. **Optimized Sharding:** The data is stored in partitioned `.parquet` shards (50k rows per train shard) for highly efficient, low-RAM streaming and processing during distributed training.
## 📊 Dataset Structure
### Data Splits
The dataset is split using a deterministic hash-based routing method to ensure no data leakage between splits.
* **`train`**: ~99% of the unique data.
* **`validation`**: ~1% of the unique data.
### Schema (Features)
Every row in the dataset contains the following normalized fields:
| Field Name | Type | Description |
| :--- | :--- | :--- |
| `source_dataset` | `string` | The original Hugging Face repository name. |
| `input` | `string` | The user's prompt, question, or instruction. |
| `output` | `string` | The target response, answer, or completion. |
| `reasoning` | `string` | The internal chain-of-thought, thinking process, or step-by-step rationale. |
| `code` | `string` | Any extracted source code, scripts, or programmatic solutions. |
| `messages` | `string` (JSON) | OpenAI-style conversational arrays `[{"role": "user", "content": "..."}]`. |
| `tool_calls` | `string` (JSON) | Extracted function calls or tool-use parameters. |
| `has_reasoning` | `bool` | Flag indicating if reasoning/CoT data is present. |
| `has_tool_calls` | `bool` | Flag indicating if tool utilization is present. |
| `has_code` | `bool` | Flag indicating if code blocks or logic are present. |
| `is_chat` | `bool` | Flag indicating if the row represents a multi-turn conversation. |
| `raw_original` | `string` (JSON) | A serialized dump of the original, unmodified row for auditing. |
## 📚 Source Datasets
This mix is composed of the following rigorously selected datasets:
**General & Advanced Reasoning:**
* `nohurry/Opus-4.6-Reasoning-3000x-filtered`
* `Crownelius/Opus-4.6-Reasoning-2100x-formatted`
* `dalisoft/claude-opus-4.6-high-reasoning-700x`
* `Roman1111111/claude-opus-4.6-10000x`
* `Roman1111111/gemini-3.1-pro-hard-high-reasoning`
* `Roman1111111/gpt-5.4-step-by-step-reasoning`
* `Crownelius/GLM-5.0-8000x-formatted-fixed`
* `Crownelius/Opus-4.5-3000x-formatted`
* `Crownelius/Gemini-3-Pro-Opus-4.5-Kimi-K2.5-13000x-formatted`
* `Magpie-Align/Magpie-Reasoning-150K`
**Coding & Tool Use / Agentic:**
* `ajibawa-2023/Code-Reasoning-Dataset`
* `Madras1/minimax-m2.5-code-distilled-14k`
* `AmanPriyanshu/tool-reasoning-sft-CODING-text_to_terminal_v2-sft-tool-use-agent-data-cleaned-rectified`
* `Danau5tin/terminal-tasks`
## 🎯 Intended Use
This dataset is highly recommended for:
* **SFT (Supervised Fine-Tuning):** Teaching base models to think before they speak (yielding `<think>` or `<reasoning>` blocks).
* **Agentic Frameworks:** Training models to navigate terminals, write code, and utilize external tools contextually.
* **Distillation:** Transferring the reasoning capabilities of massive proprietary models (Opus, GPT-4, Gemini Pro) into smaller, efficient local parameters.
提供机构:
dschauhan08



