OpenCodeReasoning-2
收藏魔搭社区2026-05-15 更新2025-05-24 收录
下载链接:
https://modelscope.cn/datasets/nv-community/OpenCodeReasoning-2
下载链接
链接失效反馈官方服务:
资源简介:
# OpenCodeReasoning-2: A Large-scale Dataset for Reasoning in Code Generation and Critique
## Dataset Description
OpenCodeReasoning-2 is the largest reasoning-based synthetic dataset to date for coding, comprising 1.4M samples in Python and 1.1M samples in C++ across 34,799 unique competitive programming questions.
OpenCodeReasoning-2 is designed for supervised fine-tuning (SFT) tasks of code completion and code critique.
- [Github Repo](https://github.com/NVIDIA/NeMo-Skills) - Access the complete pipeline used to perform SFT.
This dataset is ready for commercial/non-commercial use.
## Data distribution
- The CodeForces problems are sourced from http://codeforces.com.
- The question collections are gathered from TACO (https://huggingface.co/datasets/BAAI/TACO), APPS (https://huggingface.co/datasets/codeparrot/apps), CodeContests
(https://huggingface.co/datasets/deepmind/code_contests), and open-r1/codeforces (https://huggingface.co/datasets/open-r1/codeforces).
- We do not include the test split of CodeContests and open-r1/codeforces.
- The solution responses are generated by R1 and critique responses are generated by QwQ.
### Python
| Source | # Question | #Sample |
|:---------------|:-----------|:-------------|
| AIZU | 2150 | 71,681 |
| AtCoder | 2080 | 64,468 |
| CodeChef | 3869 | 120,040 |
| CodeForces | 15641 | 834,523 |
| Codewars | 2506 | 79,771 |
| GeeksForGeeks | 2670 | 58,154 |
| HackerEarth | 2285 | 73,559 |
| HackerRank | 912 | 26,106 |
| Kattis | 1235 | 39,938 |
| LeetCode | 777 | 29,926 |
| Total | 34,125 | 1,398,166 |
### C++
| Source | # Question | #Sample |
|:---------------|:-----------|:-------------|
| AIZU | 2067 | 35,471 |
| AtCoder | 1988 | 62,493 |
| CodeChef | 3830 | 171,882 |
| CodeForces | 11887 | 355,180 |
| Codewars | 2492 | 155,162 |
| GeeksForGeeks | 2668 | 167,610 |
| HackerEarth | 2273 | 82,765 |
| HackerRank | 903 | 43,867 |
| Kattis | 1209 | 49,699 |
| LeetCode | 775 | 50,346 |
| Total | 30,092 | 1,174,475 |
## Data Fields
|Field|Type|Description|
|:---|:---|:---|
|id|string|A unique id for each data instance|
|question_id|string|A unique id for each question|
|question|string|The input competitive programming question. We leave it blank. Check the how-to-use-it section to get them.|
|r1_generation|string|R1's response.|
|qwq_critique|string|QwQ's response.|
|solution|string|Only the code portion of R1's response.|
|judgement|string|Only the judgement (right/wrong) from QwQ's response.|
|pass_rate|float|Value in range [0, 1] or -1 (not enough unit tests or execution system couldn't validate).|
|dataset|string|The name of the dataset from which this question is collected from (e.g., "apps", "taco", "code_contests")|
|license|string|The license associated with the dataset (e.g., "mit", "apache-2.0", "cc-by-4.0")|
|split|string|The name of the split of the dataset from which this question is collected from (e.g., "train", "valid", "test")|
|source|string|The name of the competitive programming platform (e.g., CodeForces, CodeChef)|
|difficulty|string|A difficulty label for the input question.|
|index|string|An index to retrieve the input question from APPS/TACO dataset (only available for train-extra split).|
## How to use it
```
from tqdm import tqdm
from datasets import load_dataset
hf_datasets = {
"taco": load_dataset("BAAI/TACO", trust_remote_code=True),
"apps": load_dataset("codeparrot/apps", trust_remote_code=True),
"code_contests": load_dataset("deepmind/code_contests"),
"open-r1/codeforces": load_dataset("open-r1/codeforces")
}
def get_question(ds_name, split, index):
benchmark = hf_datasets[ds_name][split][int(index)]
if ds_name == "code_contests":
if not benchmark["description"]:
return None
return benchmark["description"]
elif ds_name in ["taco", "apps"]:
return benchmark["question"]
elif ds_name == "open-r1/codeforces":
if not benchmark["description"]:
return None
question = benchmark["description"]
if benchmark["input_format"]:
question += "\n\nInput\n\n" + benchmark["input_format"]
if benchmark["output_format"]:
question += "\n\nOutput\n\n" + benchmark["output_format"]
if benchmark["examples"]:
question += "\n\nExamples"
for example in benchmark["examples"]:
if "input" in example:
question += "\n\nInput\n\n" + example["input"]
if "output" in example:
question += "\n\nOutput\n\n" + example["output"]
if benchmark["note"]:
question += "\n\nNote\n\n" + benchmark["note"]
return question
return None
ocr2_dataset = load_dataset("nvidia/OpenCodeReasoning-2")
for ocr2_ds in [ocr2_dataset["python"], ocr2_dataset["cpp"]]:
for ocr2_ds_item in tqdm(ocr2_ds):
assert ocr2_ds_item["dataset"] in ["taco", "apps", "code_contests", "open-r1/codeforces"]
ds_name, ds_split, ds_index = ocr2_ds_item["dataset"], ocr2_ds_item["split"], int(ocr2_ds_item["index"])
question = get_question(ds_name, ds_split, ds_index)
assert question is not None
assert ocr2_ds_item["question"] == "-"
ocr2_ds_item["question"] = question
```
## Dataset Owner(s)
NVIDIA Corporation
## Dataset Creation Date
March 2025 - May 2025
## License/Terms of Use
GOVERNING TERMS: GOVERNING TERMS: This dataset is licensed under the Creative Commons Attribution 4.0 International License (CC BY 4.0) available at https://creativecommons.org/licenses/by/4.0/legalcode.
NOTICE FOR SCRIPTS: You may run the scripts below to pull datasets from their original source. The underlying datasets are available from the original sources subject to their own license terms.
**Data Developer:** NVIDIA
### Use Case: <br>
Developers training Large Language Models (LLMs) to specialize LLMs in code generation and code critique. <br>
### Release Date: <br>
05/15/2025 <br>
## Data Version
1.0 (05/15/2025)
## Dataset Characterization
** Data Collection Method<br>
* [Hybrid: Automated, Synthetic] <br>
** Labeling Method<be>
* [Hybrid: Automated, Synthetic] <br>
## Intended Usage
The OpenCodeReasoning-2 Dataset is intended to be used by the community to continue to improve open models. The data may be freely used to train models. **However, for
each dataset a user elects to use, the user is responsible for checking if the dataset license is fit for the intended purpose**.
## Ethical Considerations:
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).
# OpenCodeReasoning-2:一款面向代码生成与代码评审推理的大规模数据集
## 数据集概述
OpenCodeReasoning-2 是目前规模最大的面向编码推理的合成数据集,涵盖基于34799道独特竞赛编程题的140万条Python样本与110万条C++样本。本数据集专为代码补全与代码评审的监督微调(Supervised Fine-Tuning, SFT)任务设计。
- [GitHub 仓库](https://github.com/NVIDIA/NeMo-Skills) - 可获取用于执行监督微调的完整流程。
本数据集可免费用于商业与非商业用途。
## 数据分布
- CodeForces 题目源自 http://codeforces.com。
- 题目集采集自 TACO(https://huggingface.co/datasets/BAAI/TACO)、APPS(https://huggingface.co/datasets/codeparrot/apps)、CodeContests(https://huggingface.co/datasets/deepmind/code_contests)以及 open-r1/codeforces(https://huggingface.co/datasets/open-r1/codeforces)。
- 本数据集未包含 CodeContests 与 open-r1/codeforces 的测试划分集。
- 解题回复由 R1 生成,评审回复由 QwQ 生成。
### Python
| 来源平台 | 题目数量 | 样本数量 |
|:---------------|:---------|:-------------|
| AIZU | 2150 | 71,681 |
| AtCoder | 2080 | 64,468 |
| CodeChef | 3869 | 120,040 |
| CodeForces | 15641 | 834,523 |
| Codewars | 2506 | 79,771 |
| GeeksForGeeks | 2670 | 58,154 |
| HackerEarth | 2285 | 73,559 |
| HackerRank | 912 | 26,106 |
| Kattis | 1235 | 39,938 |
| LeetCode | 777 | 29,926 |
| 总计 | 34,125 | 1,398,166 |
### C++
| 来源平台 | 题目数量 | 样本数量 |
|:---------------|:---------|:-------------|
| AIZU | 2067 | 35,471 |
| AtCoder | 1988 | 62,493 |
| CodeChef | 3830 | 171,882 |
| CodeForces | 11887 | 355,180 |
| Codewars | 2492 | 155,162 |
| GeeksForGeeks | 2668 | 167,610 |
| HackerEarth | 2273 | 82,765 |
| HackerRank | 903 | 43,867 |
| Kattis | 1209 | 49,699 |
| LeetCode | 775 | 50,346 |
| 总计 | 30,092 | 1,174,475 |
## 数据字段
| 字段名 | 类型 | 描述 |
|:---------------|:-------|:---------------------------------------------------------------------|
| id | 字符串 | 每条数据实例的唯一标识符 |
| question_id | 字符串 | 每道题目的唯一标识符 |
| question | 字符串 | 竞赛编程输入题目。本字段留空,详情请参阅使用指南以获取完整题目内容 |
| r1_generation | 字符串 | R1 生成的回复 |
| qwq_critique | 字符串 | QwQ 生成的评审回复 |
| solution | 字符串 | R1 回复中仅包含的代码部分 |
| judgement | 字符串 | QwQ 回复中仅包含的判定结果(正确/错误) |
| pass_rate | 浮点数 | 取值范围为 [0, 1],或为 -1(表示单元测试不足或执行系统无法验证) |
| dataset | 字符串 | 该题目所属的原始数据集名称(例如:"apps"、"taco"、"code_contests") |
| license | 字符串 | 该原始数据集对应的许可协议(例如:"mit"、"apache-2.0"、"cc-by-4.0") |
| split | 字符串 | 该题目所属的原始数据集划分名称(例如:"train"、"valid"、"test") |
| source | 字符串 | 竞赛编程平台名称(例如:CodeForces、CodeChef) |
| difficulty | 字符串 | 该题目的难度标签 |
| index | 字符串 | 用于从 APPS/TACO 数据集中检索原始题目的索引(仅适用于 train-extra 划分)|
## 使用方法
from tqdm import tqdm
from datasets import load_dataset
hf_datasets = {
"taco": load_dataset("BAAI/TACO", trust_remote_code=True),
"apps": load_dataset("codeparrot/apps", trust_remote_code=True),
"code_contests": load_dataset("deepmind/code_contests"),
"open-r1/codeforces": load_dataset("open-r1/codeforces")
}
def get_question(ds_name, split, index):
benchmark = hf_datasets[ds_name][split][int(index)]
if ds_name == "code_contests":
if not benchmark["description"]:
return None
return benchmark["description"]
elif ds_name in ["taco", "apps"]:
return benchmark["question"]
elif ds_name == "open-r1/codeforces":
if not benchmark["description"]:
return None
question = benchmark["description"]
if benchmark["input_format"]:
question += "
输入
" + benchmark["input_format"]
if benchmark["output_format"]:
question += "
输出
" + benchmark["output_format"]
if benchmark["examples"]:
question += "
示例"
for example in benchmark["examples"]:
if "input" in example:
question += "
输入
" + example["input"]
if "output" in example:
question += "
输出
" + example["output"]
if benchmark["note"]:
question += "
备注
" + benchmark["note"]
return question
return None
ocr2_dataset = load_dataset("nvidia/OpenCodeReasoning-2")
for ocr2_ds in [ocr2_dataset["python"], ocr2_dataset["cpp"]]:
for ocr2_ds_item in tqdm(ocr2_ds):
assert ocr2_ds_item["dataset"] in ["taco", "apps", "code_contests", "open-r1/codeforces"]
ds_name, ds_split, ds_index = ocr2_ds_item["dataset"], ocr2_ds_item["split"], int(ocr2_ds_item["index"])
question = get_question(ds_name, ds_split, ds_index)
assert question is not None
assert ocr2_ds_item["question"] == "-"
ocr2_ds_item["question"] = question
## 数据集所有者
NVIDIA 公司
## 数据集创建时间
2025年3月 - 2025年5月
## 使用许可与条款
### 管辖条款
本数据集采用知识共享署名4.0国际许可协议(Creative Commons Attribution 4.0 International License, CC BY 4.0)进行授权,详情请参阅:https://creativecommons.org/licenses/by/4.0/legalcode。
### 脚本使用声明
您可运行以下脚本从原始数据源拉取数据集,底层数据集的使用需遵循其自身的许可条款。
**数据开发者:** NVIDIA
### 适用场景
用于训练大语言模型(Large Language Model, LLM)以强化其代码生成与代码评审能力的开发者。
### 发布日期
2025年5月15日
## 数据版本
1.0(2025年5月15日)
## 数据集特征
**数据采集方式**
* [混合模式:自动化采集、合成生成]
**标注方式**
* [混合模式:自动化标注、合成生成]
## 预期用途
OpenCodeReasoning-2 数据集旨在面向社区开放,用于推动开源模型的迭代优化。本数据可免费用于模型训练,但用户需自行核查所使用数据集的许可协议是否符合其预期用途。
## 伦理考量
NVIDIA 认为可信人工智能是一项共同责任,我们已建立相关政策与实践以支撑各类AI应用的开发。开发者在遵循本服务条款的前提下下载或使用本数据集时,应与其内部模型团队协作,确保该模型符合相关行业与应用场景的要求,并规避潜在的产品滥用风险。
请通过[此链接](https://www.nvidia.com/en-us/support/submit-security-vulnerability/)提交安全漏洞或NVIDIA AI相关问题。
提供机构:
maas
创建时间:
2025-05-17



