OpenCodeInstruct
收藏魔搭社区2026-05-16 更新2025-05-03 收录
下载链接:
https://modelscope.cn/datasets/nv-community/OpenCodeInstruct
下载链接
链接失效反馈官方服务:
资源简介:
# OpenCodeInstruct: A Large-scale Instruction Tuning Dataset for Code LLMs
## Dataset Description
We introduce OpenCodeInstruct, the largest open-access instruction tuning dataset, comprising 5 million diverse samples. OpenCodeInstruct is designed for supervised fine-tuning (SFT).
- [Technical Report](https://arxiv.org/abs/2504.04030) - Discover the methodology and technical details behind OpenCodeInstruct.
- [Github Repo](https://github.com/NVIDIA/NeMo-Skills) - Access the complete pipeline used to perform SFT.
This dataset is ready for commercial/non-commercial use.
## Dataset Owner(s)
NVIDIA Corporation
## Dataset Creation Date
January 2025 - March 2025
## License/Terms of Use
GOVERNING TERMS: This dataset is licensed under the Creative Commons Attribution 4.0 International License (CC BY 4.0) available at https://creativecommons.org/licenses/by/4.0/legalcode.
**Data Developer:** NVIDIA
### Use Case: <br>
Developers training LLMs to specialize LLMs in code generation. <br>
### Release Date: <br>
04/28/2025 <br>
## Intended Usage
The OpenCodeInstruct Dataset is intended to be used by the community to continue to improve open models. The data may be freely used to train models. **However, for
each dataset a user elects to use, the user is responsible for checking if the dataset license is fit for the intended purpose**.
## Dataset Characterization
** Data Collection Method<br>
* [Hybrid: Automated, Synthetic] <br>
** Labeling Method<be>
* [Hybrid: Automated, Synthetic] <br>
## Dataset Format
|Field|Type|Description|
|:---|:---|:---|
|id|string|A unique id for each question|
|input|string|The input coding question.|
|output|string|LLM's response.|
|domain|string|Either "generic" or "algorithmic".|
|generation_algorithm|string|Either "self-instruct" or "evol-instruct".|
|llm_judgement|string|string representation of a JSON dictionary containing the LLM's evaluation of the response based on several criteria.|
|unit_tests|string|string representation of a list of assertion statements.|
|tests_execution_status|string|string representation of a list of strings indicating "pass" or "fail".|
|average_test_score|float|Fraction of test cases passed.|
## How to Use It
You can load the dataset with the following two lines of code.
```python
from datasets import load_dataset
opencodeinstruct = load_dataset("nvidia/OpenCodeInstruct", split="train")
```
## Dataset Quantification
- Record Count - 5 million coding question-answer pairs.
- Download Size - 6.4 GB
## Ethical Considerations:
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).
## Citation
If you find the data useful, please cite:
```
@article{ahmad2025opencodeinstruct,
title={OpenCodeInstruct: A Large-scale Instruction Tuning Dataset for Code LLMs},
author={Wasi Uddin Ahmad and Aleksander Ficek and Mehrzad Samadi and Jocelyn Huang and Vahid Noroozi and Somshubra Majumdar and Boris Ginsburg},
year={2025},
eprint={2504.04030},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2504.04030},
}
```
# OpenCodeInstruct:面向代码大语言模型(Large Language Model, LLM)的大规模指令微调数据集
## 数据集描述
我们推出OpenCodeInstruct——目前规模最大的开源可获取指令微调数据集,包含500万条多样化样本。本数据集专为监督微调(Supervised Fine-Tuning, SFT)场景设计。
- [技术报告](https://arxiv.org/abs/2504.04030) —— 了解OpenCodeInstruct背后的方法论与技术细节。
- [GitHub仓库](https://github.com/NVIDIA/NeMo-Skills) —— 获取用于执行监督微调的完整流程代码。
本数据集可免费用于商业与非商业用途。
## 数据集所有者
英伟达公司(NVIDIA Corporation)
## 数据集创建时间
2025年1月 —— 2025年3月
## 使用许可与条款
**管辖条款**:本数据集采用知识共享署名4.0国际许可协议(Creative Commons Attribution 4.0 International License, CC BY 4.0),许可详情可访问:https://creativecommons.org/licenses/by/4.0/legalcode。
**数据开发者**:英伟达(NVIDIA)
### 适用场景:
用于训练针对代码生成任务进行专项优化的大语言模型。
### 发布日期:
2025年4月28日
## 预期用途
本数据集旨在供社区用于持续改进开源模型。用户可自由使用该数据训练模型。**但对于每一个用户选择使用的数据集,用户需自行核查该数据集的许可是否符合其预期用途**。
## 数据集特征
**数据收集方法**
* [混合模式:自动化生成、合成构建]
**标注方法**
* [混合模式:自动化生成、合成构建]
## 数据集格式
|字段|类型|描述|
|:---|:---|:---|
|id|字符串|每个问题的唯一标识符|
|input|字符串|输入的编码问题|
|output|字符串|大语言模型的输出结果|
|domain|字符串|取值范围为「通用(generic)」或「算法类(algorithmic)」|
|generation_algorithm|字符串|生成算法,取值为「自指令(self-instruct)」或「进化指令(evol-instruct)」|
|llm_judgement|字符串|以JSON字符串形式存储的字典,包含大语言模型基于多项评估标准对模型输出的评审结果|
|unit_tests|字符串|以JSON字符串形式存储的单元测试断言语句列表|
|tests_execution_status|字符串|以JSON字符串形式存储的测试执行状态列表,用于标注各测试用例「通过(pass)」或「失败(fail)」|
|average_test_score|浮点数|通过测试用例的比例|
## 使用方法
您可通过以下两行代码加载该数据集:
python
from datasets import load_dataset
opencodeinstruct = load_dataset("nvidia/OpenCodeInstruct", split="train")
## 数据集量化信息
- 样本数量:500万条编码问答对
- 下载大小:6.4 GB
## 伦理考量
英伟达(NVIDIA)认为,可信人工智能是一项共同责任,我们已建立相关政策与实践规范,以支持各类人工智能应用的开发。开发者在按照本服务条款下载或使用本数据集时,应与内部模型团队协作,确保所开发的模型符合相关行业与应用场景的要求,并防范可能出现的产品误用问题。
请[在此](https://www.nvidia.com/en-us/support/submit-security-vulnerability/)提交安全漏洞报告或反馈英伟达人工智能相关问题。
## 引用方式
如果您认为本数据集对您的研究有所帮助,请引用如下文献:
bibtex
@article{ahmad2025opencodeinstruct,
title={OpenCodeInstruct: 面向代码大语言模型的大规模指令微调数据集},
author={Wasi Uddin Ahmad and Aleksander Ficek and Mehrzad Samadi and Jocelyn Huang and Vahid Noroozi and Somshubra Majumdar and Boris Ginsburg},
year={2025},
eprint={2504.04030},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2504.04030},
}
提供机构:
maas
创建时间:
2025-04-29



