KodCode-V1

Name: KodCode-V1
Creator: maas
Published: 2026-05-21 18:08:58
License: 暂无描述

魔搭社区2026-05-21 更新2025-03-08 收录

下载链接：

https://modelscope.cn/datasets/AI-ModelScope/KodCode-V1

下载链接

链接失效反馈

官方服务：

资源简介：

# 🐱 KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for Coding KodCode is the largest fully-synthetic open-source dataset providing verifiable solutions and tests for coding tasks. It contains 12 distinct subsets spanning various domains (from algorithmic to package-specific knowledge) and difficulty levels (from basic coding exercises to interview and competitive programming challenges). KodCode is designed for both supervised fine-tuning (SFT) and RL tuning. - 🕸️ [Project Website](https://kodcode-ai.github.io/) - To discover the reasoning for the name of KodCode 🤨 - 📄 [Technical Report](https://arxiv.org/abs/2503.02951) - Discover the methodology and technical details behind KodCode - 💾 [Github Repo](https://github.com/KodCode-AI/kodcode) - Access the complete pipeline used to produce KodCode V1 - 🤗 HF Datasets: - [KodCode-V1 (For RL)](https://huggingface.co/datasets/KodCode/KodCode-V1) [You are here!]; - [KodCode-V1-SFT-R1 (for SFT)](https://huggingface.co/datasets/KodCode/KodCode-V1-SFT-R1); - [KodCode-V1-SFT-4o (for SFT)](https://huggingface.co/datasets/KodCode/KodCode-V1-SFT-4o). ![KodCode](https://kodcode-ai.github.io/static/images/kodcode-pipeline.jpg) ✨ Update `v1.1`: Support a new `Online Judge` style! ## 📊 Dataset Details ### Subsets - Prefill (Simple Coding Questions, 43K) - Leetcode (Coding Assessment Questions, 27K) - Codeforces (Coding Assessment Questions, 33K) - Apps (Coding Assessment Questions, 21K) - Taco (Coding Assessment Questions, 81K) - Code Contests (Coding Assessment Questions, 36K) - Algorithm (DSA Knowledge, 31K) - Data Structure (DSA Knowledge, 34K) - Docs (Technical Documentations, 43K) - Filter (Others, 77K) - Package (Others，7K) - Evol (Others, 13K) ### Data Formats - `version`: KodCode version. Currently we have `v1.0` and an `v1.1` with online judge style questions. - `style`: Instruct / Complete / Online Judge. Instruct provides question in natural language. Complete provides function signatures and test examples. Online Judge is converted from Instruct, which employs `stdio`. - `subset`: As mentioned above. - `conversation_id`: Unique question identifier in KodCode. - `question`: Synthesized coding question. - `solution`: Verified implementation generated by `gpt-4o-0513`. - `test`: Unit tests generated by `gpt-4o-0513`. Paired with `solution`. Tests for Instruct & Complete styles are formatted in `Pytest`. Tests for Online Judge are formatted in `stdio`. You can transform the string to dictionary via `ast.literal_eval(test)`. - `test_info`: Contains function name, parameter list, declaration, and docstring. If you are doing RL, you are suggested to include this information in the prompt. - `gpt_pass_sequence`: We generate solution-test pairs up to 10 times. A value of 1 indicates the solution passed self-verification via unit tests on that trial, while 0 indicates failure. - `gpt_pass_trial_num`: Number of trials that passed self-verification. - `gpt_pass_percentage`: Percentage of passing trials relative to total trials. - `gpt_difficulty`: Question difficulty level derived from `gpt_pass_percentage`. - `trials`: Detailed information for each trial, including coverage statistics generated by `Pytest`. - `metadata`: Contains seed information for internal debugging purposes. - `benchmark_similarity`: Maximum cosine similarity between this question and all questions from HumanEval, MBPP, BigCodeBench, and LiveCodeBench. - `filter_reason`: For questions labeled `use_with_caution`, explains why the question was filtered based on our pre-determined rules. ## 🧐 Other Information **License**: Please follow [CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/deed.en). **Contact**: Please contact [Zhangchen](mailto:zxu9@uw.edu) by email. ## 📚 Citation If you find the data or code useful, please cite: ``` @article{xu2025kodcode, title={KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for Coding}, author={Zhangchen Xu and Yang Liu and Yueqin Yin and Mingyuan Zhou and Radha Poovendran}, year={2025}, eprint={2503.02951}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2503.02951}, } ```

# 🐱 KodCode：面向编码任务的多样化、高挑战性且可验证的合成数据集 KodCode是目前规模最大的全合成开源数据集，可为编码任务提供可验证的解决方案与测试用例。该数据集包含12个独立子集，覆盖算法到特定包知识等多个领域，以及从基础编码练习到面试与竞赛编程挑战的全难度层级，专为监督微调（SFT）与强化学习（RL）微调设计。 - 🕸️ [项目官网](https://kodcode-ai.github.io/) - 了解KodCode名称的由来 🤨 - 📄 [技术报告](https://arxiv.org/abs/2503.02951) - 探索KodCode背后的研究方法与技术细节 - 💾 [GitHub仓库](https://github.com/KodCode-AI/kodcode) - 获取构建KodCode V1的完整流程 - 🤗 Hugging Face数据集： - [KodCode-V1（适用于RL）](https://huggingface.co/datasets/KodCode/KodCode-V1) [您当前所在位置!]; - [KodCode-V1-SFT-R1（适用于SFT）](https://huggingface.co/datasets/KodCode/KodCode-V1-SFT-R1); - [KodCode-V1-SFT-4o（适用于SFT）](https://huggingface.co/datasets/KodCode/KodCode-V1-SFT-4o). ![KodCode](https://kodcode-ai.github.io/static/images/kodcode-pipeline.jpg) ✨ 更新 v1.1：新增支持「在线评测（Online Judge）」格式！ ## 📊 数据集详情 ### 子集 - 预填充（简单编码问题，43K） - Leetcode（编程测评问题，27K） - Codeforces（编程测评问题，33K） - Apps（编程测评问题，21K） - Taco（编程测评问题，81K） - Code Contests（编程测评问题，36K） - 算法（数据结构与算法知识，31K） - 数据结构（数据结构与算法知识，34K） - 文档（技术文档，43K） - 过滤（其他类别，77K） - 包（其他类别，7K） - 演化（其他类别，13K） ### 数据格式 - `version`：KodCode版本号，当前包含`v1.0`以及支持在线评测格式的`v1.1`。 - `style`：分为指令（Instruct）、补全（Complete）与在线评测（Online Judge）三类。指令格式以自然语言给出问题；补全格式提供函数签名与测试示例；在线评测格式由指令格式转换而来，采用标准输入输出（stdio）模式。 - `subset`：如前文所述的子集分类。 - `conversation_id`：KodCode内唯一的问题标识符。 - `question`：合成的编码问题。 - `solution`：由`gpt-4o-0513`生成的经过验证的实现代码。 - `test`：由`gpt-4o-0513`生成的单元测试用例，与`solution`一一对应。指令与补全格式的测试采用Pytest格式，在线评测格式的测试采用标准输入输出（stdio）格式。可通过`ast.literal_eval(test)`将测试字符串转换为字典。 - `test_info`：包含函数名、参数列表、声明与文档字符串。若用于强化学习（RL）任务，建议将此信息加入提示词中。 - `gpt_pass_sequence`：我们最多生成10次解-测试对。值为1表示该次尝试的解决方案通过了单元测试的自我验证，值为0则表示失败。 - `gpt_pass_trial_num`：通过自我验证的尝试次数。 - `gpt_pass_percentage`：通过验证的尝试次数占总尝试次数的百分比。 - `gpt_difficulty`：基于`gpt_pass_percentage`推导的问题难度等级。 - `trials`：每次尝试的详细信息，包括由Pytest生成的覆盖统计数据。 - `metadata`：包含用于内部调试的种子信息。 - `benchmark_similarity`：该问题与HumanEval、MBPP、BigCodeBench及LiveCodeBench中所有问题的最大余弦相似度。 - `filter_reason`：针对标记为`use_with_caution`的问题，说明基于预设规则被过滤的原因。 ## 🧐 其他信息 **许可证**：请遵循[CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/deed.en)协议。 **联系方式**：请通过邮件联系[张晨](mailto:zxu9@uw.edu)。 ## 📚 引用如果您认为本数据集或代码对您的工作有所帮助，请引用： @article{xu2025kodcode, title={KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for Coding}, author={Zhangchen Xu and Yang Liu and Yueqin Yin and Mingyuan Zhou and Radha Poovendran}, year={2025}, eprint={2503.02951}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2503.02951}, }

提供机构：

maas

创建时间：

2025-03-07

搜集汇总

数据集介绍