five

KodCode-V1

收藏
魔搭社区2026-05-21 更新2025-03-08 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/KodCode-V1
下载链接
链接失效反馈
官方服务:
资源简介:
# 🐱 KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for Coding KodCode is the largest fully-synthetic open-source dataset providing verifiable solutions and tests for coding tasks. It contains 12 distinct subsets spanning various domains (from algorithmic to package-specific knowledge) and difficulty levels (from basic coding exercises to interview and competitive programming challenges). KodCode is designed for both supervised fine-tuning (SFT) and RL tuning. - 🕸️ [Project Website](https://kodcode-ai.github.io/) - To discover the reasoning for the name of KodCode 🤨 - 📄 [Technical Report](https://arxiv.org/abs/2503.02951) - Discover the methodology and technical details behind KodCode - 💾 [Github Repo](https://github.com/KodCode-AI/kodcode) - Access the complete pipeline used to produce KodCode V1 - 🤗 HF Datasets: - [KodCode-V1 (For RL)](https://huggingface.co/datasets/KodCode/KodCode-V1) [You are here!]; - [KodCode-V1-SFT-R1 (for SFT)](https://huggingface.co/datasets/KodCode/KodCode-V1-SFT-R1); - [KodCode-V1-SFT-4o (for SFT)](https://huggingface.co/datasets/KodCode/KodCode-V1-SFT-4o). ![KodCode](https://kodcode-ai.github.io/static/images/kodcode-pipeline.jpg) ✨ Update `v1.1`: Support a new `Online Judge` style! ## 📊 Dataset Details ### Subsets - Prefill (Simple Coding Questions, 43K) - Leetcode (Coding Assessment Questions, 27K) - Codeforces (Coding Assessment Questions, 33K) - Apps (Coding Assessment Questions, 21K) - Taco (Coding Assessment Questions, 81K) - Code Contests (Coding Assessment Questions, 36K) - Algorithm (DSA Knowledge, 31K) - Data Structure (DSA Knowledge, 34K) - Docs (Technical Documentations, 43K) - Filter (Others, 77K) - Package (Others,7K) - Evol (Others, 13K) ### Data Formats - `version`: KodCode version. Currently we have `v1.0` and an `v1.1` with online judge style questions. - `style`: Instruct / Complete / Online Judge. Instruct provides question in natural language. Complete provides function signatures and test examples. Online Judge is converted from Instruct, which employs `stdio`. - `subset`: As mentioned above. - `conversation_id`: Unique question identifier in KodCode. - `question`: Synthesized coding question. - `solution`: Verified implementation generated by `gpt-4o-0513`. - `test`: Unit tests generated by `gpt-4o-0513`. Paired with `solution`. Tests for Instruct & Complete styles are formatted in `Pytest`. Tests for Online Judge are formatted in `stdio`. You can transform the string to dictionary via `ast.literal_eval(test)`. - `test_info`: Contains function name, parameter list, declaration, and docstring. If you are doing RL, you are suggested to include this information in the prompt. - `gpt_pass_sequence`: We generate solution-test pairs up to 10 times. A value of 1 indicates the solution passed self-verification via unit tests on that trial, while 0 indicates failure. - `gpt_pass_trial_num`: Number of trials that passed self-verification. - `gpt_pass_percentage`: Percentage of passing trials relative to total trials. - `gpt_difficulty`: Question difficulty level derived from `gpt_pass_percentage`. - `trials`: Detailed information for each trial, including coverage statistics generated by `Pytest`. - `metadata`: Contains seed information for internal debugging purposes. - `benchmark_similarity`: Maximum cosine similarity between this question and all questions from HumanEval, MBPP, BigCodeBench, and LiveCodeBench. - `filter_reason`: For questions labeled `use_with_caution`, explains why the question was filtered based on our pre-determined rules. ## 🧐 Other Information **License**: Please follow [CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/deed.en). **Contact**: Please contact [Zhangchen](mailto:zxu9@uw.edu) by email. ## 📚 Citation If you find the data or code useful, please cite: ``` @article{xu2025kodcode, title={KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for Coding}, author={Zhangchen Xu and Yang Liu and Yueqin Yin and Mingyuan Zhou and Radha Poovendran}, year={2025}, eprint={2503.02951}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2503.02951}, } ```

# 🐱 KodCode:面向编码任务的多样化、高挑战性且可验证的合成数据集 KodCode是目前规模最大的全合成开源数据集,可为编码任务提供可验证的解决方案与测试用例。该数据集包含12个独立子集,覆盖算法到特定包知识等多个领域,以及从基础编码练习到面试与竞赛编程挑战的全难度层级,专为监督微调(SFT)与强化学习(RL)微调设计。 - 🕸️ [项目官网](https://kodcode-ai.github.io/) - 了解KodCode名称的由来 🤨 - 📄 [技术报告](https://arxiv.org/abs/2503.02951) - 探索KodCode背后的研究方法与技术细节 - 💾 [GitHub仓库](https://github.com/KodCode-AI/kodcode) - 获取构建KodCode V1的完整流程 - 🤗 Hugging Face数据集: - [KodCode-V1(适用于RL)](https://huggingface.co/datasets/KodCode/KodCode-V1) [您当前所在位置!]; - [KodCode-V1-SFT-R1(适用于SFT)](https://huggingface.co/datasets/KodCode/KodCode-V1-SFT-R1); - [KodCode-V1-SFT-4o(适用于SFT)](https://huggingface.co/datasets/KodCode/KodCode-V1-SFT-4o). ![KodCode](https://kodcode-ai.github.io/static/images/kodcode-pipeline.jpg) ✨ 更新 v1.1:新增支持「在线评测(Online Judge)」格式! ## 📊 数据集详情 ### 子集 - 预填充(简单编码问题,43K) - Leetcode(编程测评问题,27K) - Codeforces(编程测评问题,33K) - Apps(编程测评问题,21K) - Taco(编程测评问题,81K) - Code Contests(编程测评问题,36K) - 算法(数据结构与算法知识,31K) - 数据结构(数据结构与算法知识,34K) - 文档(技术文档,43K) - 过滤(其他类别,77K) - 包(其他类别,7K) - 演化(其他类别,13K) ### 数据格式 - `version`:KodCode版本号,当前包含`v1.0`以及支持在线评测格式的`v1.1`。 - `style`:分为指令(Instruct)、补全(Complete)与在线评测(Online Judge)三类。指令格式以自然语言给出问题;补全格式提供函数签名与测试示例;在线评测格式由指令格式转换而来,采用标准输入输出(stdio)模式。 - `subset`:如前文所述的子集分类。 - `conversation_id`:KodCode内唯一的问题标识符。 - `question`:合成的编码问题。 - `solution`:由`gpt-4o-0513`生成的经过验证的实现代码。 - `test`:由`gpt-4o-0513`生成的单元测试用例,与`solution`一一对应。指令与补全格式的测试采用Pytest格式,在线评测格式的测试采用标准输入输出(stdio)格式。可通过`ast.literal_eval(test)`将测试字符串转换为字典。 - `test_info`:包含函数名、参数列表、声明与文档字符串。若用于强化学习(RL)任务,建议将此信息加入提示词中。 - `gpt_pass_sequence`:我们最多生成10次解-测试对。值为1表示该次尝试的解决方案通过了单元测试的自我验证,值为0则表示失败。 - `gpt_pass_trial_num`:通过自我验证的尝试次数。 - `gpt_pass_percentage`:通过验证的尝试次数占总尝试次数的百分比。 - `gpt_difficulty`:基于`gpt_pass_percentage`推导的问题难度等级。 - `trials`:每次尝试的详细信息,包括由Pytest生成的覆盖统计数据。 - `metadata`:包含用于内部调试的种子信息。 - `benchmark_similarity`:该问题与HumanEval、MBPP、BigCodeBench及LiveCodeBench中所有问题的最大余弦相似度。 - `filter_reason`:针对标记为`use_with_caution`的问题,说明基于预设规则被过滤的原因。 ## 🧐 其他信息 **许可证**:请遵循[CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/deed.en)协议。 **联系方式**:请通过邮件联系[张晨](mailto:zxu9@uw.edu)。 ## 📚 引用 如果您认为本数据集或代码对您的工作有所帮助,请引用: @article{xu2025kodcode, title={KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for Coding}, author={Zhangchen Xu and Yang Liu and Yueqin Yin and Mingyuan Zhou and Radha Poovendran}, year={2025}, eprint={2503.02951}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2503.02951}, }
提供机构:
maas
创建时间:
2025-03-07
搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
KodCode-V1是一个大型的、完全合成的开源编码数据集,包含12个不同的子集,涵盖多个领域和难度级别,适用于监督微调和强化学习调优。数据集提供了可验证的解决方案和测试,并支持多种数据格式和风格。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作