openai_humaneval

Name: openai_humaneval
Creator: maas
Published: 2025-12-03 17:25:52
License: 暂无描述

魔搭社区2025-12-03 更新2025-01-11 收录

下载链接：

https://modelscope.cn/datasets/openai-mirror/openai_humaneval

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Card for OpenAI HumanEval ## Table of Contents - [OpenAI HumanEval](#openai-humaneval) - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Initial Data Collection and Normalization](#initial-data-collection-and-normalization) - [Who are the source language producers?](#who-are-the-source-language-producers) - [Annotations](#annotations) - [Annotation process](#annotation-process) - [Who are the annotators?](#who-are-the-annotators) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Repository:** [GitHub Repository](https://github.com/openai/human-eval) - **Paper:** [Evaluating Large Language Models Trained on Code](https://arxiv.org/abs/2107.03374) ### Dataset Summary The HumanEval dataset released by OpenAI includes 164 programming problems with a function sig- nature, docstring, body, and several unit tests. They were handwritten to ensure not to be included in the training set of code generation models. ### Supported Tasks and Leaderboards ### Languages The programming problems are written in Python and contain English natural text in comments and docstrings. ## Dataset Structure ```python from datasets import load_dataset load_dataset("openai_humaneval") DatasetDict({ test: Dataset({ features: ['task_id', 'prompt', 'canonical_solution', 'test', 'entry_point'], num_rows: 164 }) }) ``` ### Data Instances An example of a dataset instance: ``` { "task_id": "test/0", "prompt": "def return1():\n", "canonical_solution": " return 1", "test": "def check(candidate):\n assert candidate() == 1", "entry_point": "return1" } ``` ### Data Fields - `task_id`: identifier for the data sample - `prompt`: input for the model containing function header and docstrings - `canonical_solution`: solution for the problem in the `prompt` - `test`: contains function to test generated code for correctness - `entry_point`: entry point for test ### Data Splits The dataset only consists of a test split with 164 samples. ## Dataset Creation ### Curation Rationale Since code generation models are often trained on dumps of GitHub a dataset not included in the dump was necessary to properly evaluate the model. However, since this dataset was published on GitHub it is likely to be included in future dumps. ### Source Data The dataset was handcrafted by engineers and researchers at OpenAI. #### Initial Data Collection and Normalization [More Information Needed] #### Who are the source language producers? [More Information Needed] ### Annotations [More Information Needed] #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information None. ## Considerations for Using the Data Make sure you execute generated Python code in a safe environment when evauating against this dataset as generated code could be harmful. ### Social Impact of Dataset With this dataset code generating models can be better evaluated which leads to fewer issues introduced when using such models. ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators OpenAI ### Licensing Information MIT License ### Citation Information ``` @misc{chen2021evaluating, title={Evaluating Large Language Models Trained on Code}, author={Mark Chen and Jerry Tworek and Heewoo Jun and Qiming Yuan and Henrique Ponde de Oliveira Pinto and Jared Kaplan and Harri Edwards and Yuri Burda and Nicholas Joseph and Greg Brockman and Alex Ray and Raul Puri and Gretchen Krueger and Michael Petrov and Heidy Khlaaf and Girish Sastry and Pamela Mishkin and Brooke Chan and Scott Gray and Nick Ryder and Mikhail Pavlov and Alethea Power and Lukasz Kaiser and Mohammad Bavarian and Clemens Winter and Philippe Tillet and Felipe Petroski Such and Dave Cummings and Matthias Plappert and Fotios Chantzis and Elizabeth Barnes and Ariel Herbert-Voss and William Hebgen Guss and Alex Nichol and Alex Paino and Nikolas Tezak and Jie Tang and Igor Babuschkin and Suchir Balaji and Shantanu Jain and William Saunders and Christopher Hesse and Andrew N. Carr and Jan Leike and Josh Achiam and Vedant Misra and Evan Morikawa and Alec Radford and Matthew Knight and Miles Brundage and Mira Murati and Katie Mayer and Peter Welinder and Bob McGrew and Dario Amodei and Sam McCandlish and Ilya Sutskever and Wojciech Zaremba}, year={2021}, eprint={2107.03374}, archivePrefix={arXiv}, primaryClass={cs.LG} } ``` ### Contributions Thanks to [@lvwerra](https://github.com/lvwerra) for adding this dataset.

# OpenAI HumanEval 数据集卡片 ## 目录 - [OpenAI HumanEval](#openai-humaneval) - [目录](#table-of-contents) - [数据集描述](#dataset-description) - [数据集摘要](#dataset-summary) - [支持任务与评测榜单](#supported-tasks-and-leaderboards) - [语言情况](#languages) - [数据集结构](#dataset-structure) - [数据实例](#data-instances) - [数据字段](#data-fields) - [数据划分](#data-splits) - [数据集构建](#dataset-creation) - [构建依据](#curation-rationale) - [源数据](#source-data) - [初始数据收集与标准化](#initial-data-collection-and-normalization) - [源文本生产者是谁？](#who-are-the-source-language-producers) - [标注信息](#annotations) - [标注流程](#annotation-process) - [标注者是谁？](#who-are-the-annotators?) - [个人与敏感信息](#personal-and-sensitive-information) - [数据集使用注意事项](#considerations-for-using-the-data) - [数据集社会影响](#social-impact-of-dataset) - [偏差讨论](#discussion-of-biases) - [其他已知局限性](#other-known-limitations) - [附加信息](#additional-information) - [数据集管护者](#dataset-curators) - [许可信息](#licensing-information) - [引用信息](#citation-information) - [贡献致谢](#contributions) ## 数据集描述 - **代码仓库：** [GitHub 仓库](https://github.com/openai/human-eval) - **相关论文：** [《评测代码训练的大语言模型》](https://arxiv.org/abs/2107.03374) ### 数据集摘要 OpenAI 发布的 HumanEval 数据集包含164道编程题，每道题均附带函数签名（function signature）、文档字符串（docstring）、代码主体与多组单元测试。所有题目均为手工编写，以确保未被纳入代码生成模型的训练数据集。 ### 支持任务与评测榜单 ### 语言情况本数据集的编程题均采用 Python 语言编写，注释与文档字符串中包含英文自然文本。 ## 数据集结构 python from datasets import load_dataset load_dataset("openai_humaneval") DatasetDict({ test: Dataset({ features: ['task_id', 'prompt', 'canonical_solution', 'test', 'entry_point'], num_rows: 164 }) }) ### 数据实例数据集实例示例如下： { "task_id": "test/0", "prompt": "def return1(): ", "canonical_solution": " return 1", "test": "def check(candidate): assert candidate() == 1", "entry_point": "return1" } ### 数据字段 - `task_id`: 数据样本唯一标识符 - `prompt`: 模型输入，包含函数头与文档字符串 - `canonical_solution`: 对应`prompt`中编程题的标准解法 - `test`: 用于验证生成代码正确性的测试函数 - `entry_point`: 测试用例的入口点 ### 数据划分本数据集仅包含一个测试划分，共计164个样本。 ## 数据集构建 ### 构建依据由于代码生成模型通常基于GitHub代码库快照进行训练，因此需要构建一个未被纳入该类快照的数据集，以实现对模型的可靠评测。但本数据集已发布至GitHub，未来的代码快照中可能会包含该数据集。 ### 源数据本数据集由OpenAI的工程师与研究人员手工构建。 #### 初始数据收集与标准化 [需补充更多信息] #### 源文本生产者是谁？ [需补充更多信息] ### 标注信息 [需补充更多信息] #### 标注流程 [需补充更多信息] #### 标注者是谁？ [需补充更多信息] ### 个人与敏感信息无。 ## 数据集使用注意事项在基于本数据集评估生成代码时，请务必在安全环境中运行生成的Python代码，因为生成的代码可能存在危害性。 ### 数据集社会影响借助本数据集，代码生成模型的评测精度得以提升，进而降低使用此类模型时可能引入的问题。 ### 偏差讨论 [需补充更多信息] ### 其他已知局限性 [需补充更多信息] ## 附加信息 ### 数据集管护者 OpenAI ### 许可信息 MIT许可证 ### 引用信息 @misc{chen2021evaluating, title={Evaluating Large Language Models Trained on Code}, author={Mark Chen and Jerry Tworek and Heewoo Jun and Qiming Yuan and Henrique Ponde de Oliveira Pinto and Jared Kaplan and Harri Edwards and Yuri Burda and Nicholas Joseph and Greg Brockman and Alex Ray and Raul Puri and Gretchen Krueger and Michael Petrov and Heidy Khlaaf and Girish Sastry and Pamela Mishkin and Brooke Chan and Scott Gray and Nick Ryder and Mikhail Pavlov and Alethea Power and Lukasz Kaiser and Mohammad Bavarian and Clemens Winter and Philippe Tillet and Felipe Petroski Such and Dave Cummings and Matthias Plappert and Fotios Chantzis and Elizabeth Barnes and Ariel Herbert-Voss and William Hebgen Guss and Alex Nichol and Alex Paino and Nikolas Tezak and Jie Tang and Igor Babuschkin and Suchir Balaji and Shantanu Jain and William Saunders and Christopher Hesse and Andrew N. Carr and Jan Leike and Josh Achiam and Vedant Misra and Evan Morikawa and Alec Radford and Matthew Knight and Miles Brundage and Mira Murati and Katie Mayer and Peter Welinder and Bob McGrew and Dario Amodei and Sam McCandlish and Ilya Sutskever and Wojciech Zaremba}, year={2021}, eprint={2107.03374}, archivePrefix={arXiv}, primaryClass={cs.LG} } ### 贡献致谢感谢 [@lvwerra](https://github.com/lvwerra) 为本数据集添加支持。

提供机构：

maas

创建时间：

2025-01-08

5,000+

优质数据集

54 个

任务类型

进入经典数据集