naimulislam/reasoning_code_advanced_1m

Name: naimulislam/reasoning_code_advanced_1m
Creator: naimulislam
Published: 2025-12-20 18:12:24
License: 暂无描述

Hugging Face2025-12-20 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/naimulislam/reasoning_code_advanced_1m

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit task_categories: - text-generation - question-answering language: - en tags: - reasoning - chain-of-thought - code - synthetic - programming - logic - cot size_categories: - 1M<n<10M pretty_name: Reasoning Code Advanced 1M --- # 💻 Reasoning Code Advanced 1M ## 📖 Dataset Summary **Reasoning Code Advanced 1M** is a massive-scale, synthetic dataset specifically engineered to improve the algorithmic reasoning and problem-solving capabilities of Large Language Models (LLMs). Featuring 1,000,000 unique coding samples, this dataset spans multiple programming languages (Python, JS, C++, etc.) and focuses on logic-heavy development tasks. A key feature of this dataset is its **Adaptive Reasoning Architecture**. Unlike standard instruction datasets, the Chain-of-Thought (CoT) traces are strictly correlated with difficulty levels and are encapsulated within `<thinking>` tags to train the model's "internal monologue" before it emits code. ## ⚙️ Dataset Structure Each record follows a structured JSON format optimized for fine-tuning: | Field | Type | Description | | :--- | :--- | :--- | | `serial_number` | int | Unique identifier for the sample. | | `difficulty` | str | Complexity level: **Easy**, **Medium**, or **Hard**. | | `question` | str | The coding prompt or technical challenge. | | `reasoning` | str | The internal thought process (CoT) wrapped in `<thinking>...</thinking>`. | | `final_answer` | str | The actual code block or solution. | ### Data Instance Example (Hard Difficulty) ```json { "serial_number": 88210, "difficulty": "Hard", "question": "Python: Write a function to check if a binary tree is height-balanced.", "reasoning": "<thinking>To check if a tree is balanced, I need to calculate the height of left and right subtrees for every node. A tree is balanced if the absolute difference between heights is ≤ 1. I should use a recursive DFS approach. To optimize, I can return -1 if a subtree is unbalanced to avoid redundant calculations.</thinking>", "final_answer": "def is_balanced(root):\n def check(node):\n if not node: return 0\n left = check(node.left)\n right = check(node.right)\n if left == -1 or right == -1 or abs(left - right) > 1: return -1\n return max(left, right) + 1\n return check(root) != -1" } ``` ## 🧠 Difficulty & Reasoning Logic This dataset is designed to teach models **when** to invoke deep reasoning. The `reasoning` field population follows a strict probability distribution: | Difficulty | Reasoning Presence | Training Objective | | :--- | :--- | :--- | | **Easy** | 0% (None) | Focuses on direct syntax recall and simple definitions without over-thinking. | | **Medium** | 50% | Teaches the model to evaluate if a task requires a plan or can be solved directly. | | **Hard** | 100% | Forces a full step-by-step logic trace before generating complex algorithms. | ## 💻 How to Use Load the dataset via the Hugging Face `datasets` library: ```python from datasets import load_dataset # Replace with your actual repository path dataset = load_dataset("naimulislam/reasoning_code_advanced_1m") # Accessing a Hard sample sample = dataset['train'][10] print(f"[{sample['difficulty']}] {sample['question']}\n{sample['final_answer']}") ``` ## 🛠️ Dataset Creation & Scope The dataset was synthetically generated using a logic engine designed to simulate real-world programming scenarios across: * **Data Structures:** Trees, Graphs, Linked Lists, and Hash Maps. * **Algorithms:** Sorting, Searching, Dynamic Programming, and Recursion. * **System Design:** Basic architecture patterns and class structures. * **Debugging:** Identifying and fixing logical errors in provided snippets. * **Optimization:** Converting O(n²) solutions into O(n log n). ## 📜 License This dataset is released under the **MIT License**. ## 🤝 Citation If you use this dataset in your research or LLM training, please credit: ```bibtex @dataset{reasoning_code_advanced_1m, author = {Naimul Islam Nahid}, title = {Reasoning Code Advanced 1M}, year = {2025}, publisher = {Hugging Face}, journal = {naimulislam/reasoning_code_advanced_1m}, } ```

提供机构：

naimulislam

5,000+

优质数据集

54 个

任务类型

进入经典数据集