Evol-instruction-66k
收藏魔搭社区2026-05-23 更新2024-05-15 收录
下载链接:
https://modelscope.cn/datasets/codefuse-ai/Evol-instruction-66k
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for CodeFuse-Evol-instruction-66k
<div align='center'>
<p align="center">
<img src="https://modelscope.cn/api/v1/models/codefuse-ai/CodeFuse-QWen-14B/repo?Revision=master&FilePath=LOGO.jpg&View=true" width="800"/>
<p>
[[中文]](#chinese) [[English]](#english)
</div>
<a id="english"></a>
## Dataset Description
Evol-instruction-66k data is based on the method mentioned in the paper "WizardCoder: Empowering Code Large Language Models with Evol-Instruct". It enhances the fine-tuning effect of pre-trained code large models by adding complex code instructions.
This data is processed based on an open-source dataset, which can be found at [Evol-Instruct-Code-80k-v1](https://huggingface.co/datasets/nickrosh/Evol-Instruct-Code-80k-v1). The processing includes low-quality filtering, filtering similar data through HumanEval evaluation, etc. After filtering from the original 80k data, we obtain 66k high-quality training and fine-tuning data.
## Use
```python
from modelscope import MsDataset
import ast
dataset = MsDataset.load('codefuse-ai/Evol-instruction-66k').to_hf_dataset()
print(dataset)
print(dataset[0])
"""
Dataset({
features: ['instruction', 'output'],
num_rows: 66862
})
...
"""
```
## Creation
<strong>Filtering low-quality data</strong><br>
(1) Filter data with instruction length less than 10 words or greater than 1000 words;<br>
(2) Filter data with output length less than 50 words or greater than 2000 words;<br>
(3) Filter data with no markdown format or multiple markdown formats in the output;<br>
(4) Filter data with less than 3 lines or more than 100 lines of code in the markdown format;<br>
(5) Filter data with a description preceding the markdown format containing more than 200 words.<br>
<strong>Filtering data similar to humaneval</strong><br>
(1) Filter data containing any function name from humaneval;<br>
(2) Using NLTK to remove stop words and punctuation from the docstring of humaneval, obtain the core words such as "sort array prime", etc. Filter data containing more than 40% of the core words from humaneval.<br>
## Structure
````
Instruction:
Execute a multistage computational task consisting of organizing the provided list of numerals in an ascending sequence and then determining the middle, or median, figure from the resultant ordered sequence.
data = [12, 30, 4, 18, 24]
Output:
Here's a Python solution:
```python
data = [12, 30, 4, 18, 24]
# organizing the list in ascending sequence
data.sort()
# Finding the middle figure from the sorted sequence
n = len(data)
if n % 2 == 0:
median = (data[n//2 - 1] + data[n//2]) / 2
else:
median = data[n//2]
print(f\"Sorted Data: {data}\")
print(f\"Median: {median}\")
```
First, we sort the list in ascending order using the `sort()` function. Then, we find the median. If the length of the list is even, the median is the average of the two middle numbers. If the length is odd, the median is the middle number. In this specific case, because the list length is odd (5), the middle value (i.e., the median) is the 3rd value in the sorted list.
````
<a id="chinese"></a>
## 简介
Evol-instruction-66k数据是根据论文《WizardCoder: Empowering Code Large Language
Models with Evol-Instruct》中提到的方法,通过添加复杂的代码指令来增强预训练代码大模型的微调效果。
该数据是在开源数据集[Evol-Instruct-Code-80k-v1](https://huggingface.co/datasets/nickrosh/Evol-Instruct-Code-80k-v1)基础上对数据进行了一系列处理,包括低质量过滤、HumanEval评测相似数据过滤等,从原始80k数据筛选后得到66k高质量训练微调数据。
## 数据生产过程
<strong>过滤低质量数据</strong><br>
(1) 过滤instruction长度小于10个单词或者大于1000个单词的数据;<br>
(2) 过滤output长度小于50个单词或者大于2000个单词的数据;<br>
(3) 过滤output中无markdown格式或者有多个markdown的数据;<br>
(4) 过滤markdown格式中代码少于3行或者大于100行的数据;<br>
(5) 过滤markdown格式前面描述单词大于200个单词的数据。<br>
<strong>过滤与humaneval相似的数据</strong><br>
(1) 过滤包含humaneval中任意函数名的数据;<br>
(2) 采用NLTK去除humaneval的docstring中停用词、标点符号后,得到核心词,比如“sort array prime”等,<br>
过滤包含了humaneval超过40%核心词的数据。<br>
## 数据结构
````
Instruction:
Execute a multistage computational task consisting of organizing the provided list of numerals in an ascending sequence and then determining the middle, or median, figure from the resultant ordered sequence.
data = [12, 30, 4, 18, 24]
Output:
Here's a Python solution:
```python
data = [12, 30, 4, 18, 24]
# organizing the list in ascending sequence
data.sort()
# Finding the middle figure from the sorted sequence
n = len(data)
if n % 2 == 0:
median = (data[n//2 - 1] + data[n//2]) / 2
else:
median = data[n//2]
print(f\"Sorted Data: {data}\")
print(f\"Median: {median}\")
```
First, we sort the list in ascending order using the `sort()` function. Then, we find the median. If the length of the list is even, the median is the average of the two middle numbers. If the length is odd, the median is the middle number. In this specific case, because the list length is odd (5), the middle value (i.e., the median) is the 3rd value in the sorted list.
````
# CodeFuse-Evol-instruction-66k 数据集卡片
<div align='center'>
<p align="center">
<img src="https://modelscope.cn/api/v1/models/codefuse-ai/CodeFuse-QWen-14B/repo?Revision=master&FilePath=LOGO.jpg&View=true" width="800"/>
<p>
[[中文]](#chinese) [[英文]](#english)
</div>
<a id="english"></a>
## 数据集描述
Evol-instruction-66k 数据集基于论文《WizardCoder:通过 Evol-Instruct 赋能代码大语言模型(Code Large Language Model)》中提及的方法,通过添加复杂代码指令来提升预训练代码大语言模型的微调效果。
本数据集基于开源数据集 [Evol-Instruct-Code-80k-v1](https://huggingface.co/datasets/nickrosh/Evol-Instruct-Code-80k-v1) 进行处理,处理流程包含低质量数据过滤、通过 HumanEval 评测(HumanEval)筛选相似数据等步骤。从原始的80k数据中筛选后,我们得到了66k条高质量的训练与微调数据。
## 使用方式
python
from modelscope import MsDataset
import ast
dataset = MsDataset.load('codefuse-ai/Evol-instruction-66k').to_hf_dataset()
print(dataset)
print(dataset[0])
"""
Dataset({
features: ['instruction', 'output'],
num_rows: 66862
})
...
"""
## 数据集构建流程
<strong>低质量数据过滤</strong><br>
(1) 过滤指令(instruction)长度小于10个单词或大于1000个单词的数据;<br>
(2) 过滤输出(output)长度小于50个单词或大于2000个单词的数据;<br>
(3) 过滤输出中无Markdown格式或存在多个Markdown格式的数据;<br>
(4) 过滤Markdown格式中代码行数少于3行或多于100行的数据;<br>
(5) 过滤Markdown格式前的描述文本字数超过200个单词的数据。<br>
<strong>与HumanEval相似的数据过滤</strong><br>
(1) 过滤包含HumanEval中任意函数名的数据;<br>
(2) 使用自然语言工具包(Natural Language Toolkit,NLTK)去除HumanEval的文档字符串(docstring)中的停用词与标点符号,得到核心词(如“sort array prime”等),过滤包含超过40% HumanEval核心词的数据。<br>
## 数据集结构
`
Instruction:
Execute a multistage computational task consisting of organizing the provided list of numerals in an ascending sequence and then determining the middle, or median, figure from the resultant ordered sequence.
data = [12, 30, 4, 18, 24]
Output:
Here's a Python solution:
python
data = [12, 30, 4, 18, 24]
# organizing the list in ascending sequence
data.sort()
# Finding the middle figure from the sorted sequence
n = len(data)
if n % 2 == 0:
median = (data[n//2 - 1] + data[n//2]) / 2
else:
median = data[n//2]
print(f"Sorted Data: {data}")
print(f"Median: {median}")
First, we sort the list in ascending order using the `sort()` function. Then, we find the median. If the length of the list is even, the median is the average of the two middle numbers. If the length is odd, the median is the middle number. In this specific case, because the list length is odd (5), the middle value (i.e., the median) is the 3rd value in the sorted list.
`
<a id="chinese"></a>
## 简介
Evol-instruction-66k 数据集基于论文《WizardCoder:通过 Evol-Instruct 赋能代码大语言模型》中提及的方法,通过添加复杂代码指令来提升预训练代码大语言模型的微调效果。
本数据集基于开源数据集 [Evol-Instruct-Code-80k-v1](https://huggingface.co/datasets/nickrosh/Evol-Instruct-Code-80k-v1) 进行处理,处理流程包含低质量数据过滤、通过 HumanEval 评测筛选相似数据等步骤。从原始的80k数据中筛选后,我们得到了66k条高质量的训练与微调数据。
## 数据生产过程
<strong>低质量数据过滤</strong><br>
(1) 过滤指令(instruction)长度小于10个单词或大于1000个单词的数据;<br>
(2) 过滤输出(output)长度小于50个单词或大于2000个单词的数据;<br>
(3) 过滤输出中无Markdown格式或存在多个Markdown格式的数据;<br>
(4) 过滤Markdown格式中代码行数少于3行或多于100行的数据;<br>
(5) 过滤Markdown格式前的描述文本字数超过200个单词的数据。<br>
<strong>与HumanEval相似的数据过滤</strong><br>
(1) 过滤包含HumanEval中任意函数名的数据;<br>
(2) 使用自然语言工具包(Natural Language Toolkit,NLTK)去除HumanEval的文档字符串(docstring)中的停用词与标点符号,得到核心词(如“sort array prime”等),过滤包含超过40% HumanEval核心词的数据。<br>
## 数据结构
`
Instruction:
Execute a multistage computational task consisting of organizing the provided list of numerals in an ascending sequence and then determining the middle, or median, figure from the resultant ordered sequence.
data = [12, 30, 4, 18, 24]
Output:
Here's a Python solution:
python
data = [12, 30, 4, 18, 24]
# organizing the list in ascending sequence
data.sort()
# Finding the middle figure from the sorted sequence
n = len(data)
if n % 2 == 0:
median = (data[n//2 - 1] + data[n//2]) / 2
else:
median = data[n//2]
print(f"Sorted Data: {data}")
print(f"Median: {median}")
First, we sort the list in ascending order using the `sort()` function. Then, we find the median. If the length of the list is even, the median is the average of the two middle numbers. If the length is odd, the median is the middle number. In this specific case, because the list length is odd (5), the middle value (i.e., the median) is the 3rd value in the sorted list.
`
提供机构:
maas
创建时间:
2023-09-11



