Madras1/minimax-m2.5-code-distilled-14k
收藏Hugging Face2026-03-24 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Madras1/minimax-m2.5-code-distilled-14k
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: apache-2.0
size_categories:
- 10K<n<100K
task_categories:
- text-generation
tags:
- code
- code-generation
- distillation
- synthetic
- reasoning
- chain-of-thought
- python
- minimax
pretty_name: "MiniMax M2.5 Code Distillation"
dataset_info:
features:
- name: id
dtype: int64
- name: problem
dtype: string
- name: function_name
dtype: string
- name: reasoning
dtype: string
- name: code
dtype: string
- name: model
dtype: string
splits:
- name: train
num_examples: 14199
---
# MiniMax M2.5 Code Distillation Dataset
A synthetic code generation dataset created by distilling **MiniMax-M2.5. Each example contains a Python coding problem, the model's chain-of-thought reasoning, and a **verified correct** solution that passes automated test execution.
## Key Features
- **Execution-verified**: Every solution was executed against test cases in a sandboxed subprocess. Only solutions that **passed all tests** are included.
- **Chain-of-thought reasoning**: Each example includes the model's reasoning process (`reasoning` field), useful for training reasoning-capable models.
- **Seed-generated problems**: Problems were generated using code snippets from real-world codeparrot/starcoder corpora as seeds, ensuring diversity and practical relevance.
- **Deduplicated**: Exact dedup (identical code/prompt via MD5) + near-dedup (MinHash LSH, Jaccard > 0.7 on word 5-grams).
- **Pure Python**: All solutions use only Python standard library — no external dependencies.
## Dataset Statistics
| Metric | Value |
|---|---|
| Total examples | 14,199 |
| Examples with reasoning | 14,199 (100.0%) |
| Avg code length | 1309 chars |
| Avg reasoning length | 2967 chars |
| Teacher model | `MiniMaxAI/MiniMax-M2.5` |
| Validation method | Subprocess execution with timeout |
## Schema
| Field | Type | Description |
|---|---|---|
| `id` | int | Sequential index |
| `problem` | string | The coding problem description (cleaned, no boilerplate) |
| `function_name` | string | Name of the function to implement |
| `reasoning` | string | Model's chain-of-thought before coding |
| `code` | string | Python solution that passed all tests |
| `model` | string | Teacher model name |
## Generation Pipeline
1. **Problem Generation**: A teacher model generates unique coding problems with function signatures, docstrings, and test assertions. Problems are seeded from real code corpora (CodeParrot, StarCoder) for diversity.
2. **Solution Generation**: The same teacher model solves each problem, producing both a chain-of-thought reasoning trace and a Python solution.
3. **Execution Validation**: Each solution is executed in an isolated subprocess with test assertions and a timeout. Only solutions that pass ALL tests are included.
4. **Filtering**: Solutions with `is_correct != True`, missing code, or execution errors are discarded.
5. **Deduplication**: Exact dedup (identical code/prompt via MD5 hash) + near-dedup (prompt Jaccard similarity > 0.7 on word 5-grams via MinHash LSH). For duplicates, the version with the longest reasoning is kept.
## Usage
```python
from datasets import load_dataset
ds = load_dataset("Madras1/minimax-m2.5-code-distilled-14k")
# Access an example
example = ds["train"][0]
print(example["problem"])
print(example["function_name"])
print(example["reasoning"])
print(example["code"])
```
### For SFT Training
```python
# Format as instruction-following
def format_for_sft(example):
return {
"instruction": example["problem"],
"output": example["code"],
"reasoning": example["reasoning"],
}
ds_sft = ds["train"].map(format_for_sft)
```
## License
Apache 2.0
## Citation
```bibtex
@dataset{minimax_m25_code_distilled,
title={MiniMax M2.5 Code Distillation Dataset},
author={Madras1},
year={2026},
url={https://huggingface.co/datasets/Madras1/minimax-m2.5-code-distilled-14k},
note={Synthetic code dataset with execution-verified solutions}
}
```
提供机构:
Madras1



