five

OpenCodeReasoning

收藏
魔搭社区2026-05-16 更新2025-04-12 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/OpenCodeReasoning
下载链接
链接失效反馈
官方服务:
资源简介:
# OpenCodeReasoning: Advancing Data Distillation for Competitive Coding ## Data Overview OpenCodeReasoning is the largest reasoning-based synthetic dataset to date for coding, comprises 735,255 samples in Python across 28,319 unique competitive programming questions. OpenCodeReasoning is designed for supervised fine-tuning (SFT). - [Technical Report](https://arxiv.org/abs/2504.01943) - Discover the methodology and technical details behind OpenCodeReasoning. - [Github Repo](https://github.com/NVIDIA/NeMo-Skills) - Access the complete pipeline used to perform SFT. This dataset is ready for commercial/non-commercial use. ## Data distribution - The CodeForces problems are sourced from http://codeforces.com. - The question collections are gathered from TACO (https://huggingface.co/datasets/BAAI/TACO), APPS (https://huggingface.co/datasets/codeparrot/apps), CodeContests (https://huggingface.co/datasets/deepmind/code_contests), and open-r1/codeforces (https://huggingface.co/datasets/open-r1/codeforces). - We do not include the test split of CodeContests and open-r1/codeforces. - The output responses are generated by R1. | Source | # Question | # Sample | |:---------------|:-----------|:-----------| | AIZU | 2123 | 62,476 | | AtCoder | 2043 | 47,222 | | CodeChef | 3796 | 72,925 | | CodeForces | 10069 | 386,948 | | Codewars | 2493 | 34,326 | | GeeksForGeeks | 2667 | 37,602 | | HackerEarth | 2269 | 59,181 | | HackerRank | 895 | 10,955 | | Kattis | 1187 | 13,095 | | LeetCode | 777 | 10,525 | | Total | 28,319 | 735,255 | ## Data Fields |Field|Type|Description| |:---|:---|:---| |id|string|A unique id for each question| |input|string|The input competitive programming question (split_0 only). For split_1, user needs to get the question based on the dataset/split/index fields.| |output|string|R1's response.| |solution|string|Only the code portion of R1's response.| |dataset|string|The name of the dataset from which this question is collected from (e.g., "apps", "taco", "code_contests")| |license|string|The license associated with the dataset (e.g., "mit", "apache-2.0", "cc-by-4.0")| |split|string|The name of the split of the dataset from which this question is collected from (e.g., "train", "valid", "test")| |source|string|The name of the competitive programming platform (e.g., CodeForces, CodeChef)| |difficulty|string|A difficulty label for the input question.| |index|string|An index to retrieve the input question from APPS/TACO dataset (only available for split_1).| ## How to use it You can load the dataset with the following lines of code. ```python from datasets import load_dataset ocr_ds_split_0 = load_dataset("nvidia/OpenCodeReasoning", "split_0") print(ocr_ds_split_0) DatasetDict({ split_0: Dataset({ features: ['id', 'input', 'output', 'source', 'license', 'dataset', 'split', 'difficulty', 'solution'], num_rows: 567850 }) }) ocr_ds_split_1 = load_dataset("nvidia/OpenCodeReasoning", "split_1") print(ocr_ds_split_1) DatasetDict({ split_1: Dataset({ features: ['id', 'index', 'input', 'output', 'source', 'license', 'dataset', 'split', 'difficulty', 'solution'], num_rows: 167405 }) }) datasets = { "taco": load_dataset("BAAI/TACO"), "apps": load_dataset("codeparrot/apps") } for item in tqdm(ocr_ds_split_1["train"]): assert item["input"] == "-" assert item["dataset"] in ["taco", "apps"] item["input"] = datasets[item["dataset"]][item["split"]][int(item["index"])]["question"] ``` ## Dataset Characterization ** Data Collection Method<br> * [Hybrid: Automated, Synthetic] <br> ** Labeling Method<be> * [Hybrid: Automated, Synthetic] <br> ## License/Terms of Use This dataset is licensed under the Creative Commons Attribution 4.0 International License (CC BY 4.0) available at https://creativecommons.org/licenses/by/4.0/legalcode. **Data Developer:** NVIDIA ### Use Case: <br> Developers training LLMs to distill reasoning capabilities for code generation. <br> ### Release Date: <br> 04/04/2025 <br> ## Data Version 1.0 (04/04/2025) ## Intended use The OpenCodeReasoning Dataset is intended to be used by the community to continue to improve open models. The data may be freely used to train models. **However, for each dataset an user elects to use, the user is responsible for checking if the dataset license is fit for the intended purpose**. ## Ethical Considerations: NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/). ## Citation If you find the data useful, please cite: ``` @article{ahmad2025opencodereasoning, title={OpenCodeReasoning: Advancing Data Distillation for Competitive Coding}, author={Wasi Uddin Ahmad, Sean Narenthiran, Somshubra Majumdar, Aleksander Ficek, Siddhartha Jain, Jocelyn Huang, Vahid Noroozi, Boris Ginsburg}, year={2025}, eprint={2504.01943}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2504.01943}, } ```

# OpenCodeReasoning:面向竞赛编程的数据蒸馏研究进展 ## 数据集概览 OpenCodeReasoning是目前规模最大的基于推理的代码合成数据集,涵盖28319道独特竞赛编程题对应的735255条Python代码样本,该数据集专为监督微调(Supervised Fine-Tuning, SFT)设计。 - [技术报告](https://arxiv.org/abs/2504.01943) - 了解OpenCodeReasoning背后的方法论与技术细节。 - [GitHub仓库](https://github.com/NVIDIA/NeMo-Skills) - 获取用于执行监督微调的完整流程代码。 本数据集可用于商业与非商业用途。 ## 数据分布 - CodeForces平台的题目来源于http://codeforces.com。 - 本数据集的问题集合收集自TACO(https://huggingface.co/datasets/BAAI/TACO)、APPS(https://huggingface.co/datasets/codeparrot/apps)、CodeContests(https://huggingface.co/datasets/deepmind/code_contests)以及open-r1/codeforces(https://huggingface.co/datasets/open-r1/codeforces)。 - 我们未包含CodeContests与open-r1/codeforces的测试集划分。 - 输出响应由R1生成。 | 数据源 | 题目数量 | 样本数量 | |:---------------|:---------|:-----------| | AIZU | 2123 | 62,476 | | AtCoder | 2043 | 47,222 | | CodeChef | 3796 | 72,925 | | CodeForces | 10069 | 386,948 | | Codewars | 2493 | 34,326 | | GeeksForGeeks | 2667 | 37,602 | | HackerEarth | 2269 | 59,181 | | HackerRank | 895 | 10,955 | | Kattis | 1187 | 13,095 | | LeetCode | 777 | 10,525 | | 总计 | 28,319 | 735,255 | ## 数据字段 | 字段名 | 类型 | 描述 | |:---------------|:-------|:---------------------------------------------------------------------| | id | 字符串 | 每道竞赛题的唯一标识符 | | input | 字符串 | 竞赛编程题面(仅split_0划分可用)。若为split_1,则需通过dataset/split/index字段获取题面 | | output | 字符串 | R1生成的完整响应内容 | | solution | 字符串 | 仅保留R1响应中的代码部分 | | dataset | 字符串 | 该题所属的原始数据集名称(例如:"apps"、"taco"、"code_contests") | | license | 字符串 | 该数据集对应的许可证类型(例如:"mit"、"apache-2.0"、"cc-by-4.0") | | split | 字符串 | 该题所属的原始数据集划分名称(例如:"train"、"valid"、"test") | | source | 字符串 | 该题所属的竞赛编程平台名称(例如:CodeForces、CodeChef) | | difficulty | 字符串 | 输入题面的难度标签 | | index | 字符串 | 用于从APPS/TACO数据集中检索题面的索引(仅split_1划分可用) | ## 使用方法 你可以通过以下代码加载本数据集: python from datasets import load_dataset # 加载split_0划分 ocr_ds_split_0 = load_dataset("nvidia/OpenCodeReasoning", "split_0") print(ocr_ds_split_0) DatasetDict({ split_0: Dataset({ features: ['id', 'input', 'output', 'source', 'license', 'dataset', 'split', 'difficulty', 'solution'], num_rows: 567850 }) }) # 加载split_1划分 ocr_ds_split_1 = load_dataset("nvidia/OpenCodeReasoning", "split_1") print(ocr_ds_split_1) DatasetDict({ split_1: Dataset({ features: ['id', 'index', 'input', 'output', 'source', 'license', 'dataset', 'split', 'difficulty', 'solution'], num_rows: 167405 }) }) # 加载辅助数据集用于补全split_1的题面 datasets = { "taco": load_dataset("BAAI/TACO"), "apps": load_dataset("codeparrot/apps") } for item in tqdm(ocr_ds_split_1["train"]): assert item["input"] == "-" assert item["dataset"] in ["taco", "apps"] item["input"] = datasets[item["dataset"]][item["split"]][int(item["index"])]["question"] ## 数据集特征 **数据收集方法**<br> * [混合模式:自动化生成、合成构建] <br> **标注方法**<br> * [混合模式:自动化生成、合成构建] <br> ## 使用许可条款 本数据集采用知识共享署名4.0国际许可协议(Creative Commons Attribution 4.0 International License, CC BY 4.0)进行授权,协议详情可访问https://creativecommons.org/licenses/by/4.0/legalcode。 **数据开发者:** NVIDIA ### 应用场景:<br> 用于训练大语言模型(Large Language Model, LLM)以提升其代码生成领域的推理蒸馏能力。<br> ### 发布日期:<br> 2025年4月4日<br> ## 数据集版本 1.0(2025年4月4日) ## 预期用途 OpenCodeReasoning数据集旨在面向开源社区开放,助力大语言模型的迭代优化,用户可自由使用该数据训练模型。**但对于每一个用户选择使用的数据集,用户需自行核查该数据集的许可证是否适配其预期用途**。 ## 伦理考量 NVIDIA认为可信人工智能是一项共同责任,我们已建立相关政策与实践规范,以支撑各类AI应用的开发。开发者在遵循本服务条款下载或使用本数据集时,应与其内部模型团队协作,确保该模型符合相关行业与应用场景的要求,并防范潜在的产品滥用风险。 请[在此处](https://www.nvidia.com/en-us/support/submit-security-vulnerability/)提交安全漏洞报告或NVIDIA AI相关问题反馈。 ## 引用信息 若您认为本数据集对您的研究有所帮助,请引用如下文献: @article{ahmad2025opencodereasoning, title={OpenCodeReasoning: Advancing Data Distillation for Competitive Coding}, author={Wasi Uddin Ahmad, Sean Narenthiran, Somshubra Majumdar, Aleksander Ficek, Siddhartha Jain, Jocelyn Huang, Vahid Noroozi, Boris Ginsburg}, year={2025}, eprint={2504.01943}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2504.01943}, }
提供机构:
maas
创建时间:
2025-04-08
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作