codeparrot/instructhumaneval

Name: codeparrot/instructhumaneval
Creator: codeparrot
Published: 2023-06-13 15:58:34
License: 暂无描述

Hugging Face2023-06-13 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/codeparrot/instructhumaneval

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: task_id dtype: string - name: prompt dtype: string - name: canonical_solution dtype: string - name: test dtype: string - name: entry_point dtype: string - name: signature dtype: string - name: docstring dtype: string - name: context dtype: string - name: instruction dtype: string splits: - name: test num_bytes: 335913 num_examples: 164 download_size: 161076 dataset_size: 335913 --- # Instruct HumanEval ## Summary InstructHumanEval is a modified version of OpenAI HumanEval. For a given prompt, we extracted its signature, its docstring as well as its header to create a flexing setting which would allow to evaluation instruction-tuned LLM. The delimiters used in the instruction-tuning procedure can be use to build and instruction that would allow the model to elicit its best capabilities. Here is an example of use The prompt can be built as follows, depending on the model's instruction tuning delimiters ```python from datasets import load_dataset ds = load_dataset("codeparrot/instructhumaneval", split="test", use_auth_token=True) prompt_0 = "Human\n" + ds[0]["instruction"] + "\nAssistant\n" + ds[0]["context"] print(prompt_0) ``` Output ``` Human: Write a function has_close_elements(numbers: List[float], threshold: float) -> bool to solve the following problem: Check if in given list of numbers, are any two numbers closer to each other than given threshold. >>> has_close_elements([1.0, 2.0, 3.0], 0.5) False >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3) True Assistant: from typing import List def has_close_elements(numbers: List[float], threshold: float) -> bool: ``` The model can therefore complete the instruction and yield better results because it fits its training procedure. You can also find the code to evaluate models on the dataset in the [BigCode-evaluation-harness](https://github.com/bigcode-project/bigcode-evaluation-harness/tree/main). The following sections provide more details on the dataset. ## Dataset description This dataset is a modified version of [OpenAI HumanEval](https://huggingface.co/datasets/openai_humaneval) that is designed to adapt the benchmark to instruction fine-tuned models. As a matter of fact, HumanEval evaluates the ability to complete a code given its signature, its docstring and potentially some auxiliary functions. ## Dataset construction In order to build an instruction version of HumanEval we extracted relevant information from the **prompt** column of the original version - **signature** : this is the signature of the function to complete. It looks like `def function_name(args:type):-> return_type`. - **docstring** : this is the docstring of the function. It is the text which describes the purpose of the function. - **context** : this represents every additional information that is provided in order to help the model complete the function. It includes the imports and the auxiliary functions. Our idea was to move from the original format of HumanEval ``` <context> <signature> <docstring> ``` And build and **instruction** that would be ``` Write a function <signature> to solve the following problem: <docstring> ``` From this instruction, we can design an evaluation pipeline for instruction fine-tuned languages models. ## Evaluation Instruction fine-tuned LLM are built by fine-tuning a base LLM on an instruction dataset. This instruction dataset contains several <input, output> pairs where each represent an instruction submitted by a user together with the right answer to it. These pairs are framed into a multi-turn conversation with the help of special tokens which design each member of the interaction e.g. Q user_token `Human:`, an assistant_token `Assistant:` and and `end_token` `\n` that designates the end of each turn. ### Code completion In this case, the LLM is provided with the following prompt ``` user_token + <instruction> + <end_token> + <assistant_token> + <context> ``` It is the expected to complete the function to solve the problem formulated by the `instruction`. It is very similar to the original evaluation with the advantage that it puts the model in the best condition to understand the task that it is asked to solve. The evaluation is done on the part generated after `<assistant_token>`. ### Docstring to code This setting is more complicated as it requires to model to account for the information contained in the instruction such as the function signature. The LLM is provided with the following prompt ``` user_token + <instruction> + <end_token> + <assistant_token> ``` The model has to generate a function with the correct signature that solve adequately the problem. The evaluation is done by identifying the content of the function in the generation (by search for the right `entry_point`/`function_name`) and concatenating it with the `<context>` provided. ## How to use the dataset ```python from datasets import load_dataset ds = load_dataset("codeparrot/instructhumaneval") ``` ``` ds DatasetDict({ test: Dataset({ features: ['task_id', 'prompt', 'canonical_solution', 'test', 'entry_point', 'signature', 'docstring', 'context', 'instruction'], num_rows: 164 }) }) ```

提供机构：

codeparrot

原始信息汇总

Instruct HumanEval 数据集概述

数据集描述

InstructHumanEval 是 OpenAI HumanEval 的一个修改版本，旨在适应指令微调模型。该数据集评估模型在给定函数签名、文档字符串和可能的辅助函数的情况下完成代码的能力。

数据集结构

特征

task_id: 字符串类型，任务标识符。
prompt: 字符串类型，原始提示信息。
canonical_solution: 字符串类型，规范解决方案。
test: 字符串类型，测试代码。
entry_point: 字符串类型，入口点。
signature: 字符串类型，函数签名。
docstring: 字符串类型，函数文档字符串。
context: 字符串类型，上下文信息，包括导入和辅助函数。
instruction: 字符串类型，指令信息。

分割

test: 包含 164 个样本，总字节数为 335913。

数据集构建

为了构建指令版本的 HumanEval，从原始版本的 prompt 列中提取了以下信息：

signature: 函数签名。
docstring: 函数文档字符串。
context: 上下文信息，包括导入和辅助函数。

原始格式：

修改后的指令格式：

Write a function <signature> to solve the following problem: <docstring>

评估

指令微调的语言模型通过在指令数据集上微调基础语言模型构建。该指令数据集包含多个 <input, output> 对，每个对表示用户提交的指令及其正确答案。这些对通过特殊标记（如 Human:、Assistant: 和）框架成多轮对话。

代码完成

模型接收以下提示：

user_token + <instruction> + <end_token> + <assistant_token> + <context>

模型预期完成函数以解决指令中提出的问题。评估在 <assistant_token> 之后生成的部分进行。

文档字符串到代码

模型接收以下提示：

user_token + <instruction> + <end_token> + <assistant_token>

模型需要生成具有正确签名的函数以充分解决问题。评估通过识别生成中的函数内容（通过搜索正确的 entry_point/function_name）并与提供的 <context> 连接来进行。

如何使用数据集

python from datasets import load_dataset

ds = load_dataset("codeparrot/instructhumaneval")

数据集结构示例：

DatasetDict({ test: Dataset({ features: [task_id, prompt, canonical_solution, test, entry_point, signature, docstring, context, instruction], num_rows: 164 }) })

5,000+

优质数据集

54 个

任务类型

进入经典数据集