Qwen3-Coder-Next-OpenCode-SFT
收藏魔搭社区2026-05-17 更新2026-05-03 收录
下载链接:
https://modelscope.cn/datasets/zake7749/Qwen3-Coder-Next-OpenCode-SFT
下载链接
链接失效反馈官方服务:
资源简介:
## Overview
<div align="center">
<img src="chart3_difficulty_passrate.png" alt="Difficulty vs Pass Rate" width="80%" />
</div>
This dataset contains high-quality code reasoning data for training language models on competitive programming tasks. It is produced via rejection sampling with `Qwen3-Coder-Next`, which would generate multiple candidate solutions per problem, each candidate is executed against test cases in a sandboxed environment, and the results are used to build two complementary training datasets:
- **SFT dataset** (49,374 examples) — candidates that pass 100% of test cases
- **Preference dataset** (10,920 pairs) — DPO pairs of (chosen=passed, rejected=failed) with fine-grained rejection type labels
---
## Dataset Statistics
### Summary
| Dataset | Examples | Format |
|---------|----------|--------|
| SFT | 49,374 | JSONL + Parquet |
| Preference (DPO) | 10,920 pairs | JSONL + Parquet |
### Rejection Type Distribution (Preference Dataset)
<img src="chart2_rejection_types.png" alt="Rejection Type Distribution" width="80%" />
| Rejection Type | Count | % | Description |
| --- | --- | --- | --- |
| `wa` | 4,371 | 40.0% | Wrong answer — code ran, produced output, but incorrect |
| `no_stdin` | 2,371 | 21.7% | Code doesn't read from stdin (can't feed test input) |
| `empty_output` | 1,356 | 12.4% | Code ran without error but produced no output |
| `syntax_error` | 868 | 7.9% | SyntaxError / IndentationError (code extraction artifact) |
| `no_code` | 708 | 6.5% | No code block found in LLM response |
| `logic_error` | 645 | 5.9% | Real bug — IndexError, ValueError, RecursionError, etc. |
| `runtime_error` | 392 | 3.6% | Other runtime crash (unclassified from stderr) |
| `timeout` | 154 | 1.4% | Exceeded $15$-second time limit |
| `memory` | 55 | 0.5% | Exceeded $512$ MB memory limit |
### Problem-level Statistics
| Category | Count |
|----------|-------|
| Problems with candidates (after generation) | 12,900 |
| Problems with >= 1 passing candidate | 8,721 |
| Problems where all candidates pass (SFT only) | 3,467 |
| Problems where all candidates fail (no output) | 4,179 |
| Problems with mixed pass/fail (DPO pairs) | 5,254 |
### Pass Count Distribution per Problem
<img src="chart4_pass_histogram.png" alt="Pass Count Histogram" width="80%" />
### Token Length Statistics (Qwen3-Coder-Next tokenizer)
| | Passed | Failed |
|--|--------|--------|
| Median | 2,925 tokens | 9,257 tokens |
| Mean | 3,641 tokens | 7,749 tokens |
| P25 | 1,257 tokens | 6,588 tokens |
| P75 | 5,603 tokens | 9,488 tokens |
| Max | 9,866 tokens | 9,977 tokens |
We set the max_tokens as 10,000 due to the resource constraint.
Failed candidates tend to be longer on average, as shorter responses are more likely to contain a complete, correct solution.
### Difficulty Distribution (Codeforces subset)
Codeforces problems use Elo-style ratings. Pass rate decreases steadily with difficulty:
| Rating Range | Avg Pass Rate | # Candidates |
|--------------|---------------|--------------|
| 800 | 79% | ~5,000 |
| 1200 | 69% | ~2,000 |
| 1600 | 55% | ~2,300 |
| 2000 | 43% | ~2,500 |
| 2400 | 18% | ~2,100 |
| 3000 | 7% | ~750 |
| 3500 | 4% | ~450 |
DeepCoder problems (6,387) do not have numeric difficulty ratings and are labeled `"unknown"`.
<img src="chart5_source_difficulty.png" alt="Source x Difficulty Heatmap" width="80%" />
---
## Field Descriptions
### SFT Dataset
Each row is a single candidate solution that passed all test cases.
| Field | Type | Description |
|-------|------|-------------|
| `id` | string | Unique identifier: `{problem_id}_sft_{candidate_index}` |
| `problem` | string | Full problem statement in English |
| `response` | string | Complete model response (reasoning + code) |
| `solution` | string | Python code extracted from the response |
| `test_pass_rate` | float | Fraction of test cases passed (always 1.0 for SFT) |
| `tests_passed` | int | Number of test cases passed |
| `tests_total` | int | Total number of test cases executed |
| `source` | string | Data source: `"deepcoder"` or `"codeforces"` |
| `difficulty` | string | Problem difficulty rating (stringified integer or `"unknown"`) |
---
## Limitations
- **Test coverage**: Problems have a limited number of test cases (varies by source). A candidate passing all provided tests does not guarantee correctness on all possible inputs.
- **Stdin-only I/O**: Only solutions that read from `stdin` and write to `stdout` are supported. Problems requiring file I/O or interactive protocols are excluded.
- **Single-language**: All generated solutions are in Python. Performance-sensitive problems (tight time limits designed for C++) may have higher timeout rates.
- **Difficulty bias**: Easier problems tend to have more passing candidates, leading to more SFT data but fewer DPO pairs. Harder problems contribute more DPO pairs.
## License
Source data licenses should be verified before redistribution:
- **DeepCoder** (`agentica-org/DeepCoder-Preview-Dataset`): Check HuggingFace dataset card
- **Codeforces** (`open-r1/codeforces`): Check HuggingFace dataset card and Codeforces terms of use
## Citation
If you find our dataset useful in your work, please consider citing it as:
```
@misc{yang_opencode_qwen_2026,
title={Qwen3-Coder-Next-OpenCode-SFT},
author={Yang, Kai-Chou},
year={2026},
publisher={Hugging Face},
url={https://huggingface.co/datasets/zake7749/Qwen3-Coder-Next-OpenCode-SFT/}
}
@misc{deepcoder2025,
title={DeepCoder: A Fully Open-Source 14B Coder at O3-mini Level},
author={Michael Luo, Sijun Tan, Roy Huang, Ameen Patel, Alpay Ariyak, Qingyang Wu, Xiaoxiang Shi, Rachel Xin, Colin Cai, Maurice Weber, Ce Zhang, Li Erran Li, Raluca Ada Popa, Ion Stoica},
howpublished={\url{https://pretty-radio-b75.notion.site/DeepCoder-A-Fully-Open-Source-14B-Coder-at-O3-mini-Level-1cf81902c14680b3bee5eb349a512a51}},
note={Notion Blog},
year={2025}
}
@misc{penedo2025codeforces,
title={CodeForces},
author={Guilherme Penedo and Anton Lozhkov and Hynek Kydlíček and Loubna Ben Allal and Edward Beeching and Agustín Piqueres Lajarín and Quentin Gallouédec and Nathan Habib and Lewis Tunstall and Leandro von Werra},
year={2025},
publisher = {Hugging Face},
journal = {Hugging Face repository},
howpublished = {\url{https://huggingface.co/datasets/open-r1/codeforces}}
}
```
## 概述
<div align="center"><img src="chart3_difficulty_passrate.png" alt="难度与通过率" width="80%" /></div>
本数据集包含用于竞赛编程任务的大语言模型训练用高质量代码推理数据。其通过基于`Qwen3-Coder-Next`的拒绝采样(rejection sampling)方法构建:先为每道编程题生成多个候选解决方案,随后将每个候选代码在沙箱环境中针对测试用例执行,最终基于执行结果构建两个互补的训练数据集:
- **监督微调(Supervised Fine-Tuning, SFT)数据集**(49374条样本):通过所有测试用例的候选解决方案
- **偏好数据集**(10920条偏好对):采用直接偏好优化(Direct Preference Optimization, DPO)格式的(选中=通过、未选中=失败)样本对,并附带细粒度的失败类型标签
---
## 数据集统计
### 概况
| 数据集 | 样本数 | 格式 |
|---------|----------|--------|
| 监督微调(SFT)数据集 | 49,374 | JSONL + Parquet |
| 偏好(DPO)数据集 | 10,920 对 | JSONL + Parquet |
### 失败类型分布(偏好数据集)
<div align="center"><img src="chart2_rejection_types.png" alt="失败类型分布" width="80%" /></div>
| 失败类型 | 数量 | 占比 | 描述 |
| --- | --- | --- | --- |
| `wa` | 4,371 | 40.0% | 答案错误:代码正常运行并生成输出,但结果不正确 |
| `no_stdin` | 2,371 | 21.7% | 无标准输入读取:代码未从标准输入读取数据,无法接收测试输入 |
| `empty_output` | 1,356 | 12.4% | 空输出:代码无错误运行但未生成任何输出 |
| `syntax_error` | 868 | 7.9% | 语法错误/缩进错误(代码提取过程产生的异常产物) |
| `no_code` | 708 | 6.5% | 无代码块:大语言模型返回结果中未找到代码块 |
| `logic_error` | 645 | 5.9% | 逻辑错误:真实程序漏洞,如索引越界、值错误、递归错误等 |
| `runtime_error` | 392 | 3.6% | 运行时错误:其他未分类的运行时崩溃(基于标准错误输出) |
| `timeout` | 154 | 1.4% | 超时:超过15秒的时间限制 |
| `memory` | 55 | 0.5% | 内存超限:超过512MB的内存限制 |
### 题目级统计
| 类别 | 数量 |
|----------|-------|
| 生成候选方案后的题目总数 | 12,900 |
| 存在至少1个通过候选方案的题目 | 8,721 |
| 所有候选方案均通过的题目(仅用于SFT) | 3,467 |
| 所有候选方案均失败的题目(无有效输出) | 4,179 |
| 存在通过/失败混合候选方案的题目(用于生成DPO对) | 5,254 |
### 单题通过次数分布
<div align="center"><img src="chart4_pass_histogram.png" alt="单题通过次数分布" width="80%" /></div>
### Token长度统计(基于Qwen3-Coder-Next分词器)
| | 通过的候选方案 | 失败的候选方案 |
|--|--------|--------|
| 中位数 | 2,925 个Token | 9,257 个Token |
| 平均值 | 3,641 个Token | 7,749 个Token |
| P25分位数 | 1,257 个Token | 6,588 个Token |
| P75分位数 | 5,603 个Token | 9,488 个Token |
| 最大值 | 9,866 个Token | 9,977 个Token |
由于资源限制,我们将最大Token数设置为10000。失败的候选方案平均长度更长,因为更短的回复更有可能包含完整且正确的解决方案。
### 难度分布(Codeforces子集)
Codeforces题目采用Elo等级分评级,通过率随题目难度提升稳步下降:
| 评级区间 | 平均通过率 | 候选方案数量 |
|--------------|---------------|--------------|
| 800 | 79% | 约5,000 |
| 1200 | 69% | 约2,000 |
| 1600 | 55% | 约2,300 |
| 2000 | 43% | 约2,500 |
| 2400 | 18% | 约2,100 |
| 3000 | 7% | 约750 |
| 3500 | 4% | 约450 |
DeepCoder题目(共6,387道)未设置数值化难度评级,标注为"unknown"(未知)。
<div align="center"><img src="chart5_source_difficulty.png" alt="来源与难度热力图" width="80%" /></div>
---
## 字段说明
### 监督微调数据集
每一行代表一个通过所有测试用例的候选解决方案。
| 字段 | 类型 | 描述 |
|-------|------|-------------|
| `id` | 字符串 | 唯一标识符:格式为`{problem_id}_sft_{candidate_index}` |
| `problem` | 字符串 | 英文完整题目描述 |
| `response` | 字符串 | 模型完整输出(包含推理过程与代码) |
| `solution` | 字符串 | 从模型输出中提取的Python代码 |
| `test_pass_rate` | 浮点数 | 通过的测试用例占比(监督微调数据集的该值恒为1.0) |
| `tests_passed` | 整数 | 通过的测试用例数量 |
| `tests_total` | 整数 | 执行的测试用例总数 |
| `source` | 字符串 | 数据来源:`"deepcoder"` 或 `"codeforces"` |
| `difficulty` | 字符串 | 题目难度评级(字符串形式的整数或`"unknown"`) |
---
## 局限性
- **测试覆盖局限性**:不同来源的题目仅配备有限数量的测试用例。候选方案通过所有给定测试用例,并不保证其在所有可能输入下均正确。
- **仅支持标准输入输出**:仅支持从标准输入`stdin`读取数据并向标准输出`stdout`写入结果的解决方案。需要文件IO或交互协议的题目已被排除。
- **单语言限制**:所有生成的解决方案均为Python代码。对于对性能敏感的题目(如为C++设计的严格时间限制),可能会出现更高的超时率。
- **难度偏差**:简单题目通常拥有更多通过的候选方案,因此监督微调数据集的样本更多,但对应的DPO偏好对更少;难题则贡献更多的DPO偏好对。
## 许可协议
重新分发前需验证源数据的许可协议:
- **DeepCoder**(数据集标识:`agentica-org/DeepCoder-Preview-Dataset`):请查阅HuggingFace数据集卡片获取许可信息
- **Codeforces**(数据集标识:`open-r1/codeforces`):请查阅HuggingFace数据集卡片及Codeforces使用条款获取许可信息
## 引用
若您的工作中使用了本数据集,请引用如下文献:
@misc{yang_opencode_qwen_2026,
title={Qwen3-Coder-Next-OpenCode-SFT},
author={Yang, Kai-Chou},
year={2026},
publisher={Hugging Face},
url={https://huggingface.co/datasets/zake7749/Qwen3-Coder-Next-OpenCode-SFT/}
}
@misc{deepcoder2025,
title={DeepCoder: A Fully Open-Source 14B Coder at O3-mini Level},
author={Michael Luo, Sijun Tan, Roy Huang, Ameen Patel, Alpay Ariyak, Qingyang Wu, Xiaoxiang Shi, Rachel Xin, Colin Cai, Maurice Weber, Ce Zhang, Li Erran Li, Raluca Ada Popa, Ion Stoica},
howpublished={url{https://pretty-radio-b75.notion.site/DeepCoder-A-Fully-Open-Source-14B-Coder-at-O3-mini-Level-1cf81902c14680b3bee5eb349a512a51}},
note={Notion Blog},
year={2025}
}
@misc{penedo2025codeforces,
title={CodeForces},
author={Guilherme Penedo and Anton Lozhkov and Hynek Kydlíček and Loubna Ben Allal and Edward Beeching and Agustín Piqueres Lajarín and Quentin Gallouédec and Nathan Habib and Lewis Tunstall and Leandro von Werra},
year={2025},
publisher = {Hugging Face},
journal = {Hugging Face repository},
howpublished = {url{https://huggingface.co/datasets/open-r1/codeforces}}
}
提供机构:
maas
创建时间:
2026-03-14



