OpenCodeGeneticInstruct
收藏魔搭社区2025-12-04 更新2025-06-21 收录
下载链接:
https://modelscope.cn/datasets/nv-community/OpenCodeGeneticInstruct
下载链接
链接失效反馈官方服务:
资源简介:
# OpenCodeGeneticInstruct: A large-scale dataset of coding instructions for improving the code generation capabilities of LLMs
## Data Overview
OpenCodeGeneticInstruct comprises more than 15M coding instructions in python which is generated synthetically with the Genetic-Instruct [1] approach.
This dataset can be used for supervised fine-tuning (SFT) of LLMs to improve their code genearation capability.
Each sample includes a coding question/instruction and its corrsponding answer. The answer contains the coding solution in Python.
- [Paper](https://arxiv.org/abs/2407.21077) - Discover the methodology and technical details behind OpenCodeGeneticInstruct.
- [Github Repo](https://github.com/NVIDIA/NeMo-Skills) - Access the complete pipeline used to perform the SFT experiments.
This dataset can be used for commercial/non-commercial use.
## Data distribution
- We use 512 samples from the Tiger-Leetcode [TigerResearch. 2023. Tigerbot kaggle leetcode solutions dataset (english) - 2k. https://huggingface.co/datasets/TigerResearch/tigerbot-kaggle-leetcodesolutions-en-2k.] collection as the initial population to generate this dataset.
- Two genearation models, Mixtral-8x22B-Instruct and Qwen2.5-32B-Instruct, are used to generate the samples.
| Generation Model | # Seed | # Sample |
|:--------------------------|:-----------|:-----------|
| Mixtral-8x22B-Instruct | 512 | 7,593,134 |
| Qwen2.5-32B-Instruct | 512 | 7,000,000 |
| Total | 512 | 15,093,134 |
## Dataset Owner(s)
NVIDIA Corporation
## Dataset Creation Date
January 2025 - March 2025
## License/Terms of Use
GOVERNING TERMS: This dataset is licensed under the Creative Commons Attribution 4.0 International License (CC BY 4.0) available at https://creativecommons.org/licenses/by/4.0/legalcode.
**Data Developer:** NVIDIA
### Use Case: <br>
Developers training LLMs to specialize LLMs in code generation. <br>
### Release Date: <br>
05/06/2025 <br>
## Intended Usage
The OpenCodeInstruct Dataset is intended to be used by the community to continue to improve open models. The data may be freely used to train models. **However, for
each dataset a user elects to use, the user is responsible for checking if the dataset license is fit for the intended purpose**.
## Dataset Characterization
** Data Collection Method<br>
* [Hybrid: Automated, Synthetic] <br>
** Labeling Method<be>
* [Hybrid: Automated, Synthetic] <br>
## Dataset Format
|Field|Type|Description|
|:---|:---|:---|
|id|string|A unique id for each question|
|input|string|The input competitive programming question |
|output|string|Model's response.|
|solution|string|Only the code portion of Model's response.|
|last_operation|string|Operation used to generate this question (mutation, crossover).|
|generation_model|string|Model used to generate this question/response (Mixtral-8x22B-Instruct, Qwen2.5-32B-Instruct).|
## How to use it
You can load the dataset with the following lines of code.
```python
from datasets import load_dataset
ds_mixtral = load_dataset("nvidia/OpenCodeGeneticInstruct",
name="mixtral-8x22b-instruct", split="train")
print(ds_mixtral)
## output:
## Dataset({
## features: ['id', 'input', 'solution', 'output', 'operation', 'model'],
## num_rows: 7593134
## })
ds_qwen = load_dataset("nvidia/OpenCodeGeneticInstruct",
name="qwen2.5-32b-instruct", split="train")
print(ds_qwen)
## output:
## Dataset({
## features: ['id', 'input', 'solution', 'output', 'operation', 'model'],
## num_rows: 7500000
##})
```
## Dataset Quantification
- Record Count - 15 million coding question-answer pairs.
- Download Size - 25 GB
## Ethical Considerations:
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).
## Citation
If you find the data useful, please cite:
```
[1] @article{majumdar2025geneticinstructscalingsynthetic,
title = {Genetic Instruct: Scaling up Synthetic Generation of Coding Instructions for Large Language Models},
author = {Somshubra Majumdar, Vahid Noroozi, Mehrzad Samadi, Sean Narenthiran, Aleksander Ficek, Wasi Uddin Ahmad, Jocelyn Huang, Jagadeesh Balam, Boris Ginsburg},
year={2025},
eprint={2407.21077},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2407.21077},
}
```
# OpenCodeGeneticInstruct:用于提升大语言模型(LLM)代码生成能力的大规模代码指令数据集
## 数据概览
OpenCodeGeneticInstruct 包含超过1500万条Python代码指令,均通过Genetic-Instruct[1]方法合成生成。本数据集可用于大语言模型的监督微调(SFT),以提升其代码生成能力。每个样本均包含一条代码问题/指令及其对应答案,答案中包含Python语言实现的代码解决方案。
- [论文](https://arxiv.org/abs/2407.21077) - 了解OpenCodeGeneticInstruct背后的方法论与技术细节。
- [GitHub仓库](https://github.com/NVIDIA/NeMo-Skills) - 获取用于监督微调实验的完整流程。
本数据集可用于商业与非商业用途。
## 数据分布
我们从Tiger-Leetcode数据集[TigerResearch. 2023. Tigerbot Kaggle LeetCode解决方案数据集(英文)-2k。https://huggingface.co/datasets/TigerResearch/tigerbot-kaggle-leetcodesolutions-en-2k.]中选取512条样本作为初始种群,用于生成本数据集。我们使用Mixtral-8x22B-Instruct与Qwen2.5-32B-Instruct两款生成模型来生成样本。
| 生成模型 | 初始种子样本数 | 生成样本数 |
|:--------------------------|:-----------|:-----------|
| Mixtral-8x22B-Instruct | 512 | 7,593,134 |
| Qwen2.5-32B-Instruct | 512 | 7,000,000 |
| 总计 | 512 | 15,093,134 |
## 数据集所属方
NVIDIA公司
## 数据集创建时间
2025年1月 - 2025年3月
## 使用许可条款
本数据集采用知识共享署名4.0国际许可协议(CC BY 4.0)进行授权,详情请见https://creativecommons.org/licenses/by/4.0/legalcode。
**数据开发者:** NVIDIA
### 适用场景:<br>
用于训练大语言模型,使其专注于代码生成任务。<br>
### 发布日期:<br>
2025年5月6日<br>
## 预期用途
本数据集旨在供社区使用,以持续优化开源模型。用户可自由使用本数据集训练模型。**但用户需自行确认所选用数据集的许可协议是否符合其预期用途。**
## 数据集特征
** 数据采集方法<br>
* [混合模式:自动化、合成生成]<br>
** 标注方法<br>
* [混合模式:自动化、合成生成]<br>
## 数据集格式
|字段|类型|描述|
|:---|:---|:---|
|id|string|每条问题的唯一标识符|
|input|string|输入的竞赛编程问题|
|output|string|模型的回复内容|
|solution|string|仅包含模型回复中的代码部分|
|last_operation|string|用于生成该问题的操作(变异、交叉)|
|generation_model|string|用于生成该问题/回复的模型(Mixtral-8x22B-Instruct、Qwen2.5-32B-Instruct)|
## 使用方法
可通过以下代码加载本数据集:
python
from datasets import load_dataset
ds_mixtral = load_dataset("nvidia/OpenCodeGeneticInstruct",
name="mixtral-8x22b-instruct", split="train")
print(ds_mixtral)
## 输出:
## Dataset({
## features: ['id', 'input', 'solution', 'output', 'operation', 'model'],
## num_rows: 7593134
## })
ds_qwen = load_dataset("nvidia/OpenCodeGeneticInstruct",
name="qwen2.5-32b-instruct", split="train")
print(ds_qwen)
## 输出:
## Dataset({
## features: ['id', 'input', 'solution', 'output', 'operation', 'model'],
## num_rows: 7500000
##})
## 数据集量化统计
- 样本总数:1500万条代码问答对。
- 下载大小:25 GB
## 伦理考量
NVIDIA认为可信人工智能是一项共同责任,我们已制定相关政策与实践规范,以支持各类人工智能应用的开发。开发者在按照本服务条款下载或使用本数据集时,应与其内部模型团队协作,确保所开发的模型符合相关行业与应用场景的要求,并防范可能出现的产品误用问题。
请通过[此链接](https://www.nvidia.com/en-us/support/submit-security-vulnerability/)提交安全漏洞或NVIDIA人工智能相关问题。
## 引用信息
若本数据集对您的研究有所帮助,请引用以下文献:
[1] @article{majumdar2025geneticinstructscalingsynthetic,
title = {Genetic Instruct: Scaling up Synthetic Generation of Coding Instructions for Large Language Models},
author = {Somshubra Majumdar, Vahid Noroozi, Mehrzad Samadi, Sean Narenthiran, Aleksander Ficek, Wasi Uddin Ahmad, Jocelyn Huang, Jagadeesh Balam, Boris Ginsburg},
year={2025},
eprint={2407.21077},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2407.21077},
}
提供机构:
maas
创建时间:
2025-06-14



