AmanPriyanshu/reasoning-sft-NextCoderDataset-100K
收藏Hugging Face2026-03-03 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/AmanPriyanshu/reasoning-sft-NextCoderDataset-100K
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
task_categories:
- text-generation
language:
- en
- code
tags:
- reasoning
- sft
- chain-of-thought
- code-editing
- code
size_categories:
- 100K<n<1M
---
# NextCoderDataset (converted)
Converted version of [microsoft/NextCoderDataset](https://huggingface.co/datasets/microsoft/NextCoderDataset), subsampled to 100,000 rows equally distributed across 8 programming languages for reasoning SFT training.
## Format
Each row has three columns:
- **`input`** - list of dicts with system and user messages (system prompt sets expert code editor role, user prompt contains the editing instruction and original code)
- **`response`** - response string with `<think>` reasoning block followed by the edited code in markdown code blocks
- **`source`** - programming language (cpp, c, rust, java, javascript, python, go, kotlin)
## Language Distribution
| Language | Rows |
|----------|------|
| c | 12,500 |
| cpp | 12,500 |
| go | 12,500 |
| java | 12,500 |
| javascript | 12,500 |
| kotlin | 12,500 |
| python | 12,500 |
| rust | 12,500 |
## Conversion
- Subsampled 12,500 rows per language from the original 381K dataset
- Added system prompt with expert code editor role
- Injected generic code-editing reasoning sequences in think blocks
- Response format: think block then edited code
## License
MIT
## Credits
Original dataset: [microsoft/NextCoderDataset](https://huggingface.co/datasets/microsoft/NextCoderDataset) - NextCoder: Robust Adaptation of Code LMs to Diverse Code Edits (ICML 2025)
提供机构:
AmanPriyanshu



