five

Madras1/minimax-m2.5-code-distilled-14k

收藏
Hugging Face2026-03-24 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Madras1/minimax-m2.5-code-distilled-14k
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en license: apache-2.0 size_categories: - 10K<n<100K task_categories: - text-generation tags: - code - code-generation - distillation - synthetic - reasoning - chain-of-thought - python - minimax pretty_name: "MiniMax M2.5 Code Distillation" dataset_info: features: - name: id dtype: int64 - name: problem dtype: string - name: function_name dtype: string - name: reasoning dtype: string - name: code dtype: string - name: model dtype: string splits: - name: train num_examples: 14199 --- # MiniMax M2.5 Code Distillation Dataset A synthetic code generation dataset created by distilling **MiniMax-M2.5. Each example contains a Python coding problem, the model's chain-of-thought reasoning, and a **verified correct** solution that passes automated test execution. ## Key Features - **Execution-verified**: Every solution was executed against test cases in a sandboxed subprocess. Only solutions that **passed all tests** are included. - **Chain-of-thought reasoning**: Each example includes the model's reasoning process (`reasoning` field), useful for training reasoning-capable models. - **Seed-generated problems**: Problems were generated using code snippets from real-world codeparrot/starcoder corpora as seeds, ensuring diversity and practical relevance. - **Deduplicated**: Exact dedup (identical code/prompt via MD5) + near-dedup (MinHash LSH, Jaccard > 0.7 on word 5-grams). - **Pure Python**: All solutions use only Python standard library — no external dependencies. ## Dataset Statistics | Metric | Value | |---|---| | Total examples | 14,199 | | Examples with reasoning | 14,199 (100.0%) | | Avg code length | 1309 chars | | Avg reasoning length | 2967 chars | | Teacher model | `MiniMaxAI/MiniMax-M2.5` | | Validation method | Subprocess execution with timeout | ## Schema | Field | Type | Description | |---|---|---| | `id` | int | Sequential index | | `problem` | string | The coding problem description (cleaned, no boilerplate) | | `function_name` | string | Name of the function to implement | | `reasoning` | string | Model's chain-of-thought before coding | | `code` | string | Python solution that passed all tests | | `model` | string | Teacher model name | ## Generation Pipeline 1. **Problem Generation**: A teacher model generates unique coding problems with function signatures, docstrings, and test assertions. Problems are seeded from real code corpora (CodeParrot, StarCoder) for diversity. 2. **Solution Generation**: The same teacher model solves each problem, producing both a chain-of-thought reasoning trace and a Python solution. 3. **Execution Validation**: Each solution is executed in an isolated subprocess with test assertions and a timeout. Only solutions that pass ALL tests are included. 4. **Filtering**: Solutions with `is_correct != True`, missing code, or execution errors are discarded. 5. **Deduplication**: Exact dedup (identical code/prompt via MD5 hash) + near-dedup (prompt Jaccard similarity > 0.7 on word 5-grams via MinHash LSH). For duplicates, the version with the longest reasoning is kept. ## Usage ```python from datasets import load_dataset ds = load_dataset("Madras1/minimax-m2.5-code-distilled-14k") # Access an example example = ds["train"][0] print(example["problem"]) print(example["function_name"]) print(example["reasoning"]) print(example["code"]) ``` ### For SFT Training ```python # Format as instruction-following def format_for_sft(example): return { "instruction": example["problem"], "output": example["code"], "reasoning": example["reasoning"], } ds_sft = ds["train"].map(format_for_sft) ``` ## License Apache 2.0 ## Citation ```bibtex @dataset{minimax_m25_code_distilled, title={MiniMax M2.5 Code Distillation Dataset}, author={Madras1}, year={2026}, url={https://huggingface.co/datasets/Madras1/minimax-m2.5-code-distilled-14k}, note={Synthetic code dataset with execution-verified solutions} } ```
提供机构:
Madras1
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作