Open-Critic-GPT
收藏魔搭社区2026-01-06 更新2024-08-31 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/Open-Critic-GPT
下载链接
链接失效反馈官方服务:
资源简介:
<img src="https://huggingface.co/Vezora/Agent-7b-v1/resolve/main/Designer.png" width="200" height="200" />
# Open-Critic-GPT Dataset
## Overview
**Creator** [Nicolas Mejia-Petit](https://twitter.com/mejia_petit)
[My Kofi](https://ko-fi.com/nicolasmejiapetit)
The Open-Critic-GPT dataset is a synthetic dataset created to train models in both identifying and fixing bugs in code. The dataset is generated using a unique synthetic data pipeline which involves:
1. Prompting a local model with an existing code example.
2. Introducing bugs into the code. While also having the model, from a first-person perspective, find the bugs and explain them.
3. Manipulating the data by shifting around where the broken code and working code is, and removing the # bug// and # error// comments from the code.
This process allows the creation of two distinct datasets within Open-Critic-GPT:
- **Code-Preference-Pairs Dataset**: (SFT) Contains pairs of duplicate code examples, with the only difference being one the rejected example has the bugged code 'surgically transplanted in' while the accepted is left the same.
- **Open-Critic-GPT Dataset**: (DPO) Trains the model to find bugs and produce working code from broken code.
- Both dataset's spans a total of 127 different language/structures, (some may have been lost in conversion started with 122k ended with 55k, due to lack of structured output, a finetuned model may preform better structured outputs.)
- Both datasets contain of ~55K examples each (which both come from the same parent example)
## Dataset Structure
The dataset is organized as follows:
- **Code Examples**: Each code example consists of a a given snippet of bugged code and asked to find the bugs and fix them:
- **Bugged Code**: The version of the code with introduced bugs and no comments, to avoid the model from learning from comments that say "Bug" or "Error".
- **Explanation**: Explanation are provided for each bugged code example, detailing the nature of the bug, what the bug does to the code, and tips to avoid it.
- **Fixed Code**: Lastly the model write the fully working code, with bugs fixed and comments added to the code.
## Usage
- Just give me credit :)
- Oh and current employee's of 'Open AI' and or the company as a whole is NOT permitted use this dataset or any derivative work that may come for training. It is mentioned in the custom apache license.
- Otherwise to everyone else, it falls under Apache 2.0 :).
### Training Models
When training models with the Open-Critic-GPT dataset, it is essential to use a data collator to ensure that the loss is not calculated on the bugged code. The data collator manages the dataset during training to provide the model with the correct inputs and outputs for loss calculation.
### Crediting dataset creators:
- This dataset was created using 'm-a-p/CodeFeedback-Filtered-Instruction' Which contiains data from several different sources
- Here are the orginal authors of the oringal sources, Thank you to the following authors: Nick Roshdieh for evol Instruct, Ajinkya Bawase for Python shareGPT 23k, Intellligent Software Engineering for Magicoder, and Multimodal Art Projection for the compiled and filtered m-a-p/CodeFeedback-Filtered-Instruction.
### Begging for money section.
- I created this dataset off a single 3090. Imagine what I could do with two.
- I can't continue to work on these open source projects, with out receiving a sponsorship or a form of compensation, all the money I make from this will go dirrectly back into helping the open source community.
- If you can, It would mean the world to me any donation helps me release this work for free. thank you :)
- [Kofi](https://ko-fi.com/nicolasmejiapetit)
<img src="https://huggingface.co/Vezora/Agent-7b-v1/resolve/main/Designer.png" width="200" height="200" />
# Open-Critic-GPT 数据集
## 概述
**创作者** [Nicolas Mejia-Petit](https://twitter.com/mejia_petit)
[我的Kofi](https://ko-fi.com/nicolasmejiapetit)
Open-Critic-GPT 数据集是一款合成数据集,旨在训练模型识别并修复代码中的漏洞。该数据集通过一套独特的合成数据流水线生成,流程如下:
1. 使用现有代码示例对本地模型进行提示。
2. 向代码中引入漏洞,同时让模型以第一人称视角找出漏洞并作出解释。
3. 对数据进行处理:调换存在漏洞的代码与可正常运行的代码的位置,并移除代码中的`# bug//`与`# error//`注释。
通过该流程,Open-Critic-GPT 数据集可拆分为两个独立的子数据集:
- **代码偏好配对数据集(Code-Preference-Pairs Dataset)**:(监督微调(Supervised Fine-Tuning, SFT))包含成对的重复代码示例,二者唯一差异为:被拒绝的示例中“精准植入”了存在漏洞的代码,而被接受的示例则保持原始状态。
- **Open-Critic-GPT 数据集(直接偏好优化(Direct Preference Optimization, DPO))**:用于训练模型识别代码漏洞并基于漏洞代码生成可正常运行的修复后代码。
- 两个子数据集均覆盖127种不同的编程语言/结构(部分数据在转换过程中可能丢失:初始数据量为122k,最终仅剩余55k,这是由于部分输出缺乏结构化格式,经微调后的模型可生成更规范的结构化输出)。
- 每个子数据集均包含约5.5万个示例(二者均源自同一组父示例)。
## 数据集结构
该数据集的组织形式如下:
- **代码示例**:每个代码示例均包含一段给定的漏洞代码,并要求模型找出其中的漏洞并完成修复:
- **漏洞代码**:已引入漏洞且未附带任何注释的代码版本,用于避免模型从标注有“Bug”或“Error”的注释中学习到漏洞信息。
- **漏洞说明**:为每个漏洞代码示例提供的说明内容,详细阐述漏洞的本质、漏洞对代码的影响以及规避该类漏洞的技巧。
- **修复后代码**:模型生成的完整可运行代码,已修复所有漏洞并添加了规范注释。
## 使用须知
- 请务必为创作者标注署名!
- 请注意:OpenAI的现任雇员及该公司整体均不得使用本数据集或任何基于本数据集衍生出的作品进行模型训练,该限制已在定制版Apache许可证中注明。
- 除上述群体外,其余所有使用者均可遵循Apache 2.0协议使用本数据集。
### 模型训练
使用Open-Critic-GPT数据集训练模型时,必须使用数据整理器以确保不会对漏洞代码部分计算损失。数据整理器会在训练过程中对数据集进行处理,为模型提供用于损失计算的正确输入与输出。
### 数据集创作者署名说明
- 本数据集基于`m-a-p/CodeFeedback-Filtered-Instruction`构建,该基础数据集包含来自多个不同来源的数据。
- 以下为各原始数据源的原作者,在此向下列作者致以谢意:Evol Instruct的创作者Nick Roshdieh、Python shareGPT 23k的创作者Ajinkya Bawase、Magicoder的创作者Intelligent Software Engineering,以及整理并过滤出`m-a-p/CodeFeedback-Filtered-Instruction`数据集的Multimodal Art Projection团队。
### 资助请求板块
- 本数据集仅通过单张RTX 3090显卡完成构建,若能拥有两张显卡,我可完成更多工作。
- 若无法获得赞助或任何形式的资助,我将难以继续推进此类开源项目,所有通过本渠道获得的收入都将直接投入开源社区建设。
- 若您有能力提供资助,哪怕数额微薄,对我而言都意义重大——每一笔捐赠都能帮助我持续免费发布这类开源作品。感谢您的支持!
- [Kofi](https://ko-fi.com/nicolasmejiapetit)
提供机构:
maas
创建时间:
2024-08-02



