m-a-p/CodeFeedback-Filtered-Instruction
收藏Hugging Face2024-02-26 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/m-a-p/CodeFeedback-Filtered-Instruction
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
pipeline_tag: text-generation
tags:
- code
license: apache-2.0
task_categories:
- question-answering
size_categories:
- 10K<n<100K
---
<h1 align="center"> OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement<h1>
<p align="center">
<img width="1000px" alt="OpenCodeInterpreter" src="https://opencodeinterpreter.github.io/static/images/figure1.png">
</p>
<p align="center">
<a href="https://opencodeinterpreter.github.io/">[🏠Homepage]</a>
|
<a href="https://github.com/OpenCodeInterpreter/OpenCodeInterpreter/">[🛠️Code]</a>
</p>
<hr>
## OpenCodeInterpreter
OpenCodeInterpreter is a family of open-source code generation systems designed to bridge the gap between large language models and advanced proprietary systems like the GPT-4 Code Interpreter. It significantly advances code generation capabilities by integrating execution and iterative refinement functionalities.
For further information and related work, refer to our paper: ["OpenCodeInterpreter: A System for Enhanced Code Generation and Execution"](https://arxiv.org/abs/2402.14658) available on arXiv.
## Dataset Description
CodeFeedback-Filtered-Instruction is a curated collection of code instruction queries extracted from four prominent open-source code instruction tuning datasets: [Magicoder-OSS-Instruct](https://huggingface.co/datasets/ise-uiuc/Magicoder-OSS-Instruct-75K), [Python code subset of ShareGPT](https://huggingface.co/datasets/ajibawa-2023/Python-Code-23k-ShareGPT), [Magicoder-Evol-Instruct](https://huggingface.co/datasets/ise-uiuc/Magicoder-Evol-Instruct-110K), and [Evol-Instruct-Code](https://huggingface.co/datasets/nickrosh/Evol-Instruct-Code-80k-v1).
Initially, 287k queries were aggregated from these datasets. To isolate the most intricate and informative instructions, a rigorous filtering process was employed.
This involved utilizing the Qwen-72B-Chat, an open-source chat model, for selective filtering.
The code queries are evaluated along with their corresponding responses within the compiled datasets by the LLM, assigning a complexity score ranging from 1 to 5, and only those rated 4 or 5 were retained for the seed set.
This meticulous filtering process resulted in a final collection of 156k high-quality single-turn code instructions.
In subsequent processing steps mentioned in the paper, besides Single-turn Packing, we exclusively utilized queries without considering responses. However, here we retained all responses to provide users with more convenient usage options.
## Contact
If you have any inquiries, please feel free to raise an issue or reach out to us via email at: xiangyue.work@gmail.com, zhengtianyu0428@gmail.com.
We're here to assist you!
⚠️The dataset contains part data generated by OpenAI's language models, please pay attention to OpenAI's usage policy when adopting this dataset: https://openai.com/policies/usage-policies.
提供机构:
m-a-p
原始信息汇总
数据集概述
数据集名称
- CodeFeedback-Filtered-Instruction
数据集来源
- 该数据集是从以下四个开源代码指令调优数据集中提取的代码指令查询:
- Magicoder-OSS-Instruct
- Python code subset of ShareGPT
- Magicoder-Evol-Instruct
- Evol-Instruct-Code
数据集规模
- 初始集合包含287k查询。
- 经过筛选后,最终集合包含156k高质量的单轮代码指令。
数据集筛选过程
- 使用Qwen-72B-Chat模型进行筛选。
- 代码查询及其响应由大型语言模型(LLM)评估,根据复杂度评分(1至5)筛选,仅保留评分4或5的查询。
数据集特点
- 数据集保留了所有响应,以便用户更方便地使用。
数据集使用注意事项
- 数据集部分内容由OpenAI的语言模型生成,使用时需遵守OpenAI的使用政策。
搜集汇总
数据集介绍

背景与挑战
背景概述
CodeFeedback-Filtered-Instruction是一个高质量代码指令数据集,包含约156k条单轮代码查询,从多个开源代码指令数据集中筛选而来,使用Qwen-72B-Chat模型评估并仅保留高复杂性评分(4或5)的条目。数据集支持多种编程语言(如Python、SQL、JavaScript),专注于代码生成和问答任务,旨在提升代码生成系统的执行和细化能力。
以上内容由遇见数据集搜集并总结生成



