KodCode-V1
收藏魔搭社区2026-05-21 更新2025-03-08 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/KodCode-V1
下载链接
链接失效反馈官方服务:
资源简介:
# 🐱 KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for Coding
KodCode is the largest fully-synthetic open-source dataset providing verifiable solutions and tests for coding tasks. It contains 12 distinct subsets spanning various domains (from algorithmic to package-specific knowledge) and difficulty levels (from basic coding exercises to interview and competitive programming challenges). KodCode is designed for both supervised fine-tuning (SFT) and RL tuning.
- 🕸️ [Project Website](https://kodcode-ai.github.io/) - To discover the reasoning for the name of KodCode 🤨
- 📄 [Technical Report](https://arxiv.org/abs/2503.02951) - Discover the methodology and technical details behind KodCode
- 💾 [Github Repo](https://github.com/KodCode-AI/kodcode) - Access the complete pipeline used to produce KodCode V1
- 🤗 HF Datasets:
- [KodCode-V1 (For RL)](https://huggingface.co/datasets/KodCode/KodCode-V1) [You are here!];
- [KodCode-V1-SFT-R1 (for SFT)](https://huggingface.co/datasets/KodCode/KodCode-V1-SFT-R1);
- [KodCode-V1-SFT-4o (for SFT)](https://huggingface.co/datasets/KodCode/KodCode-V1-SFT-4o).

✨ Update `v1.1`: Support a new `Online Judge` style!
## 📊 Dataset Details
### Subsets
- Prefill (Simple Coding Questions, 43K)
- Leetcode (Coding Assessment Questions, 27K)
- Codeforces (Coding Assessment Questions, 33K)
- Apps (Coding Assessment Questions, 21K)
- Taco (Coding Assessment Questions, 81K)
- Code Contests (Coding Assessment Questions, 36K)
- Algorithm (DSA Knowledge, 31K)
- Data Structure (DSA Knowledge, 34K)
- Docs (Technical Documentations, 43K)
- Filter (Others, 77K)
- Package (Others,7K)
- Evol (Others, 13K)
### Data Formats
- `version`: KodCode version. Currently we have `v1.0` and an `v1.1` with online judge style questions.
- `style`: Instruct / Complete / Online Judge. Instruct provides question in natural language. Complete provides function signatures and test examples. Online Judge is converted from Instruct, which employs `stdio`.
- `subset`: As mentioned above.
- `conversation_id`: Unique question identifier in KodCode.
- `question`: Synthesized coding question.
- `solution`: Verified implementation generated by `gpt-4o-0513`.
- `test`: Unit tests generated by `gpt-4o-0513`. Paired with `solution`. Tests for Instruct & Complete styles are formatted in `Pytest`. Tests for Online Judge are formatted in `stdio`. You can transform the string to dictionary via `ast.literal_eval(test)`.
- `test_info`: Contains function name, parameter list, declaration, and docstring. If you are doing RL, you are suggested to include this information in the prompt.
- `gpt_pass_sequence`: We generate solution-test pairs up to 10 times. A value of 1 indicates the solution passed self-verification via unit tests on that trial, while 0 indicates failure.
- `gpt_pass_trial_num`: Number of trials that passed self-verification.
- `gpt_pass_percentage`: Percentage of passing trials relative to total trials.
- `gpt_difficulty`: Question difficulty level derived from `gpt_pass_percentage`.
- `trials`: Detailed information for each trial, including coverage statistics generated by `Pytest`.
- `metadata`: Contains seed information for internal debugging purposes.
- `benchmark_similarity`: Maximum cosine similarity between this question and all questions from HumanEval, MBPP, BigCodeBench, and LiveCodeBench.
- `filter_reason`: For questions labeled `use_with_caution`, explains why the question was filtered based on our pre-determined rules.
## 🧐 Other Information
**License**: Please follow [CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/deed.en).
**Contact**: Please contact [Zhangchen](mailto:zxu9@uw.edu) by email.
## 📚 Citation
If you find the data or code useful, please cite:
```
@article{xu2025kodcode,
title={KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for Coding},
author={Zhangchen Xu and Yang Liu and Yueqin Yin and Mingyuan Zhou and Radha Poovendran},
year={2025},
eprint={2503.02951},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2503.02951},
}
```
# 🐱 KodCode:面向编码任务的多样化、高挑战性且可验证的合成数据集
KodCode是目前规模最大的全合成开源数据集,可为编码任务提供可验证的解决方案与测试用例。该数据集包含12个独立子集,覆盖算法到特定包知识等多个领域,以及从基础编码练习到面试与竞赛编程挑战的全难度层级,专为监督微调(SFT)与强化学习(RL)微调设计。
- 🕸️ [项目官网](https://kodcode-ai.github.io/) - 了解KodCode名称的由来 🤨
- 📄 [技术报告](https://arxiv.org/abs/2503.02951) - 探索KodCode背后的研究方法与技术细节
- 💾 [GitHub仓库](https://github.com/KodCode-AI/kodcode) - 获取构建KodCode V1的完整流程
- 🤗 Hugging Face数据集:
- [KodCode-V1(适用于RL)](https://huggingface.co/datasets/KodCode/KodCode-V1) [您当前所在位置!];
- [KodCode-V1-SFT-R1(适用于SFT)](https://huggingface.co/datasets/KodCode/KodCode-V1-SFT-R1);
- [KodCode-V1-SFT-4o(适用于SFT)](https://huggingface.co/datasets/KodCode/KodCode-V1-SFT-4o).

✨ 更新 v1.1:新增支持「在线评测(Online Judge)」格式!
## 📊 数据集详情
### 子集
- 预填充(简单编码问题,43K)
- Leetcode(编程测评问题,27K)
- Codeforces(编程测评问题,33K)
- Apps(编程测评问题,21K)
- Taco(编程测评问题,81K)
- Code Contests(编程测评问题,36K)
- 算法(数据结构与算法知识,31K)
- 数据结构(数据结构与算法知识,34K)
- 文档(技术文档,43K)
- 过滤(其他类别,77K)
- 包(其他类别,7K)
- 演化(其他类别,13K)
### 数据格式
- `version`:KodCode版本号,当前包含`v1.0`以及支持在线评测格式的`v1.1`。
- `style`:分为指令(Instruct)、补全(Complete)与在线评测(Online Judge)三类。指令格式以自然语言给出问题;补全格式提供函数签名与测试示例;在线评测格式由指令格式转换而来,采用标准输入输出(stdio)模式。
- `subset`:如前文所述的子集分类。
- `conversation_id`:KodCode内唯一的问题标识符。
- `question`:合成的编码问题。
- `solution`:由`gpt-4o-0513`生成的经过验证的实现代码。
- `test`:由`gpt-4o-0513`生成的单元测试用例,与`solution`一一对应。指令与补全格式的测试采用Pytest格式,在线评测格式的测试采用标准输入输出(stdio)格式。可通过`ast.literal_eval(test)`将测试字符串转换为字典。
- `test_info`:包含函数名、参数列表、声明与文档字符串。若用于强化学习(RL)任务,建议将此信息加入提示词中。
- `gpt_pass_sequence`:我们最多生成10次解-测试对。值为1表示该次尝试的解决方案通过了单元测试的自我验证,值为0则表示失败。
- `gpt_pass_trial_num`:通过自我验证的尝试次数。
- `gpt_pass_percentage`:通过验证的尝试次数占总尝试次数的百分比。
- `gpt_difficulty`:基于`gpt_pass_percentage`推导的问题难度等级。
- `trials`:每次尝试的详细信息,包括由Pytest生成的覆盖统计数据。
- `metadata`:包含用于内部调试的种子信息。
- `benchmark_similarity`:该问题与HumanEval、MBPP、BigCodeBench及LiveCodeBench中所有问题的最大余弦相似度。
- `filter_reason`:针对标记为`use_with_caution`的问题,说明基于预设规则被过滤的原因。
## 🧐 其他信息
**许可证**:请遵循[CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/deed.en)协议。
**联系方式**:请通过邮件联系[张晨](mailto:zxu9@uw.edu)。
## 📚 引用
如果您认为本数据集或代码对您的工作有所帮助,请引用:
@article{xu2025kodcode,
title={KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for Coding},
author={Zhangchen Xu and Yang Liu and Yueqin Yin and Mingyuan Zhou and Radha Poovendran},
year={2025},
eprint={2503.02951},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2503.02951},
}
提供机构:
maas
创建时间:
2025-03-07
搜集汇总
数据集介绍

背景与挑战
背景概述
KodCode-V1是一个大型的、完全合成的开源编码数据集,包含12个不同的子集,涵盖多个领域和难度级别,适用于监督微调和强化学习调优。数据集提供了可验证的解决方案和测试,并支持多种数据格式和风格。
以上内容由遇见数据集搜集并总结生成



