bigcodebench

Name: bigcodebench
Creator: maas
Published: 2025-12-05 16:53:54
License: 暂无描述

魔搭社区2025-12-05 更新2025-07-12 收录

下载链接：

https://modelscope.cn/datasets/bigcode/bigcodebench

下载链接

链接失效反馈

官方服务：

资源简介：

# BigCodeBench <center> <img src="https://github.com/bigcode-bench/bigcode-bench.github.io/blob/main/asset/bigcodebench_banner.svg?raw=true" alt="BigCodeBench"> </center> ## Dataset Description - **Homepage:** https://bigcode-bench.github.io/ - **Repository:** https://github.com/bigcode-project/bigcodebench - **Paper:** [Link](https://arxiv.org/abs/2406.15877) - **Point of Contact:** contact@bigcode-project.org terry.zhuo@monash.edu The dataset has 2 variants: 1. `BigCodeBench-Complete`: _Code Completion based on the structured docstrings_. 1. `BigCodeBench-Instruct`: _Code Generation based on the NL-oriented instructions_. The overall statistics of the dataset are as follows: ||Complete|Instruct| |-|-|-| | # Task | 1140 | 1140 | | # Avg. Test Cases | 5.6 | 5.6 | | # Avg. Coverage | 99% | 99% | | # Avg. Prompt Char. | 1112.5 | 663.2 | | # Avg. Prompt Line | 33.5 | 11.7 | | # Avg. Prompt Char. (Code) | 1112.5 | 124.0 | | # Avg. Solution Char. | 426.0 | 426.0 | | # Avg. Solution Line | 10.0 | 10.0 | | # Avg. Solution Cyclomatic Complexity | 3.1 | 3.1 | The function-calling (tool use) statistics of the dataset are as follows: ||Complete/Instruct| |-|-| | # Domain | 7 | | # Standard Library | 77 | | # 3rd Party Library | 62 | | # Standard Function Call | 281 | | # 3rd Party Function Call | 116 | | # Avg. Task Library | 2.8 | | # Avg. Task Fun Call | 4.7 | | # Library Combo | 577 | | # Function Call Combo | 1045 | | # Domain Combo | 56 | ### Changelog |Release|Description| |-|-| | v0.1.0 | Initial release of BigCodeBench | ### Dataset Summary BigCodeBench is an *__easy-to-use__* benchmark which evaluates LLMs with *__practical__* and *__challenging__* programming tasks. The dataset was created as part of the [BigCode Project](https://www.bigcode-project.org/), an open scientific collaboration working on the responsible development of Large Language Models for Code (Code LLMs). BigCodeBench serves as a fundamental benchmark for LLMs instead of *LLM Agents*, i.e., code-generating AI systems that enable the synthesis of programs from natural language descriptions as well as others from code snippets. ### Languages The dataset only contains natural language in English and programming language in Python (3.0+). ### How to use it ```python from datasets import load_dataset # full dataset ds = load_dataset("bigcode/bigcodebench", split="v0.1.4") # dataset streaming (will only download the data as needed) ds = load_dataset("bigcode/bigcodebench", streaming=True, split="v0.1.4") for sample in iter(ds): print(sample) ``` ## Dataset Structure ### Data Fields * `task_id` (`string`): The unique identifier for the task. * `complete_prompt` (`string`): The PEP257-structured docstring prompt. * `instruct_prompt` (`string`): The natural-language-oriented instruction prompt. * `canonical_solution` (`string`): The canonical solution w/o comments. * `code_prompt` (`string`): The code-only prompt. * `test` (`string`): The code snippet for testing, wrapped in a `unittest.TestCase` class. * `entry_point` (`string`): The entry point for the code snippet, which is `task_func`. * `doc_struct` (`string[dictionary]`): The structured docstring. * `description` (`string`): The main task description in natural language. * `note` (`string`): The additional notes for the task in natural language. * `reqs` (`string`, `optional`): The modules can be used in the task solution. * `params` (`string`, `optional`): The parameters used in the task solution. * `returns` (`string`, `optional`): The values to be returned in the task solution. * `raises` (`string`, `optional`): The exceptions should be raised in the task solution. * `examples` (`string`, `optional`): The interactive Python examples as hints for the task solution. * `libs` (`string`): The libraries can be used in the task solution. ### Data Splits The dataset has no splits, and all data is loaded as train split by default. ## Dataset Creation For more information on the dataset construction, please refer to the [technical report](https://huggingface.co/papers/). GitHub Action pipeline code is available [here](https://github.com/bigcode-project/bigcodebench-annotation). ### Curation Rationale We believe that there are three main expectations of a good execution-based programming benchmark: 1. The benchmark should be easy to use and efficient in evaluating the fundamental capabilities of LLMs. Repo-level and agent-centric benchmarks (e.g., SWE-bench) are not suitable for this purpose. 2. The benchmark should be practical, covering various programming scenarios. Algo-specific benchmarks (e.g., HumanEval and MBPP) are unsuitable. Domain-specific benchmarks (e.g., DS-1000) are also unsuitable for this purpose. 3. The benchmark should be challenging, where the tasks require LLMs' strong compositional reasoning capabilities and instruction-following capabilities. The benchmarks with simple tasks (e.g., ODEX) are unsuitable. BigCodeBench is the first benchmark that meets all three expectations. It is an easy-to-use benchmark that evaluates LLMs with challenging and practical programming tasks, accompanied by an end-to-end evaluation framework [`bigcodebench`](https://github.com/bigcode-project/bigcodebench). We aim to assess how well LLMs can solve practical and challenging programming tasks in an open-ended setting. ### Source Data #### Data Collection For the dataset construction, please refer to Section 2 in [technical report](https://huggingface.co/papers/). #### Who are the source language producers? The data was originally sourced from GPT-4-0613, with the seed examples from [ODEX](https://github.com/zorazrw/odex) (collected from StackOverflow). The data was then annotated through the collaboration between human experts and LLMs. ## Considerations for Using the Data ### Discussion of Biases We agree that there could be a few programming tasks with slightly biased instructions or over-specific test cases. Considering that software development is iterative, incremental, and collaborative, we believe that the bias can be mitigated with the long-term development of BigCodeBench and additional help from the open-source community. We are open to feedback and suggestions for improving the dataset. ### Other Known Limitations See Appendix D in [technical report](https://huggingface.co/papers/) for more information. We highlight a few limitations as follows: * Multilingualism * Saturation * Reliability * Efficiency * Rigorousness * Generalization * Evolution * Interaction ## Additional Information ### Dataset Curators 1. Terry Yue Zhuo, Monash University & CSIRO's Data61, terry.zhuo@monash.edu 1. Other BigCode Project members include * Minh Chien Vu * Jenny Chim * Han Hu * Wenhao Yu * Ratnadira Widyasari * Imam Nur Bani Yusuf * Haolan Zhan * Junda He * Indraneil Paul * Simon Brunner * Chen Gong * Thong Hoang * Armel Randy Zebaze * Xiaoheng Hong * Wen-Ding Li * Jean Kaddour * Ming Xu * Zhihan Zhang * Prateek Yadav * Niklas Muennighoff ### Licensing Information BigCodeBench is licensed under the Apache License, Version 2.0. ### Citation Information ```bibtex @article{zhuo2024bigcodebench, title={BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions}, author={Zhuo, Terry Yue and Vu, Minh Chien and Chim, Jenny and Hu, Han and Yu, Wenhao and Widyasari, Ratnadira and Yusuf, Imam Nur Bani and Zhan, Haolan and He, Junda and Paul, Indraneil and others}, journal={arXiv preprint arXiv:2406.15877}, year={2024} } ```

# BigCodeBench <center> <img src="https://github.com/bigcode-bench/bigcode-bench.github.io/blob/main/asset/bigcodebench_banner.svg?raw=true" alt="BigCodeBench"> </center> ## 数据集描述 - **主页：** https://bigcode-bench.github.io/ - **仓库：** https://github.com/bigcode-project/bigcodebench - **论文：** [链接](https://arxiv.org/abs/2406.15877) - **联系方式：** contact@bigcode-project.org、terry.zhuo@monash.edu 本数据集包含两个变体： 1. `BigCodeBench-Complete`：**基于结构化文档字符串的代码补全任务** 2. `BigCodeBench-Instruct`：**面向自然语言指令的代码生成任务** 本数据集的整体统计信息如下： | | 代码补全 | 指令生成 | | ---- | ---- | ---- | | 任务数量 | 1140 | 1140 | | 平均测试用例数 | 5.6 | 5.6 | | 平均覆盖率 | 99% | 99% | | 平均提示字符总数 | 1112.5 | 663.2 | | 平均提示行数 | 33.5 | 11.7 | | 平均提示字符数（代码部分） | 1112.5 | 124.0 | | 平均解决方案字符数 | 426.0 | 426.0 | | 平均解决方案行数 | 10.0 | 10.0 | | 平均解决方案圈复杂度 | 3.1 | 3.1 | 本数据集的函数调用（工具使用）统计信息如下： | | 代码补全/指令生成 | | ---- | ---- | | 领域数量 | 7 | | 标准库数量 | 77 | | 第三方库数量 | 62 | | 标准函数调用次数 | 281 | | 第三方函数调用次数 | 116 | | 单任务平均库数量 | 2.8 | | 单任务平均函数调用次数 | 4.7 | | 库组合数 | 577 | | 函数调用组合数 | 1045 | | 领域组合数 | 56 | ### 更新日志 | 版本 | 描述 | | ---- | ---- | | v0.1.0 | BigCodeBench 首次正式发布 | ### 数据集概述 BigCodeBench 是一款易于使用的基准测试集，用于评估大语言模型（Large Language Model, LLM）的实用且兼具挑战性的编程任务表现。本数据集由[BigCode项目](https://www.bigcode-project.org/)打造，该项目是一个开放科学协作计划，致力于负责任地开发代码大语言模型（Code Large Language Model, Code LLM）。BigCodeBench 可作为评估大语言模型的基础基准，而非大语言模型智能体（LLM Agent）——即能够从自然语言描述或代码片段生成程序的代码生成型人工智能系统。 ### 支持语言本数据集仅包含英文自然语言与Python（3.0及以上）编程语言。 ### 使用方法 python from datasets import load_dataset # 加载完整数据集 ds = load_dataset("bigcode/bigcodebench", split="v0.1.4") # 流式加载数据集（仅按需下载数据） ds = load_dataset("bigcode/bigcodebench", streaming=True, split="v0.1.4") for sample in iter(ds): print(sample) ## 数据集结构 ### 数据字段 * `task_id`（字符串类型）：任务的唯一标识符 * `complete_prompt`（字符串类型）：符合PEP257规范的结构化文档字符串提示 * `instruct_prompt`（字符串类型）：面向自然语言的指令提示 * `canonical_solution`（字符串类型）：无注释的标准解决方案 * `code_prompt`（字符串类型）：仅包含代码的提示 * `test`（字符串类型）：封装在`unittest.TestCase`类中的测试代码片段 * `entry_point`（字符串类型）：代码片段的入口点，格式为`task_func` * `doc_struct`（字符串字典类型）：结构化文档字符串，包含以下子字段： * `description`（字符串类型）：自然语言形式的主要任务描述 * `note`（字符串类型）：任务的附加说明（自然语言） * `reqs`（字符串类型，可选）：任务解决方案中可使用的模块 * `params`（字符串类型，可选）：任务解决方案中使用的参数 * `returns`（字符串类型，可选）：任务解决方案需返回的值 * `raises`（字符串类型，可选）：任务解决方案中需抛出的异常 * `examples`（字符串类型，可选）：用于任务解决方案提示的交互式Python示例 * `libs`（字符串类型）：任务解决方案中可使用的库 ### 数据划分本数据集无预设数据划分，默认将全部数据加载为训练划分。 ## 数据集构建如需了解数据集的构建细节，请参考[技术报告](https://huggingface.co/papers/)。GitHub Action 流水线代码可参见[此处](https://github.com/bigcode-project/bigcodebench-annotation)。 ### 遴选依据我们认为优秀的基于执行的编程基准测试集应满足三大核心预期： 1. **易用且高效**：能够便捷评估大语言模型的基础能力。面向仓库级与智能体的基准测试（如SWE-bench）并不适用于此目标。 2. **实用且覆盖全面**：涵盖多样化的编程场景。仅针对特定算法的基准测试（如HumanEval与MBPP）、或领域专用基准测试（如DS-1000）均无法满足这一需求。 3. **具有挑战性**：任务需要大语言模型具备强大的组合推理能力与指令遵循能力。仅包含简单任务的基准测试（如ODEX）并不适用。 BigCodeBench 是首个同时满足上述三项预期的基准测试集。它是一款易于使用的基准测试集，用于评估大语言模型解决兼具挑战性与实用性的编程任务的能力，并配套提供了端到端的评估框架[`bigcodebench`](https://github.com/bigcode-project/bigcodebench)。我们的目标是评估大语言模型在开放场景下解决实用且复杂编程任务的能力。 ### 源数据 #### 数据收集如需了解数据集的构建流程，请参考[技术报告](https://huggingface.co/papers/)中的第2节。 #### 源语言生产者本数据集最初的数据来源于GPT-4-0613，种子示例取自[ODEX](https://github.com/zorazrw/odex)（采集自StackOverflow）。随后通过人类专家与大语言模型协作完成了数据标注。 ## 数据使用注意事项 ### 偏差讨论我们承认部分编程任务可能存在指令略有偏差或测试用例过于具体的问题。考虑到软件开发是一个迭代、增量且协作的过程，我们相信通过BigCodeBench的长期迭代以及开源社区的贡献，此类偏差可以得到缓解。我们欢迎针对数据集改进的反馈与建议。 ### 其他已知局限性更多信息请参见[技术报告](https://huggingface.co/papers/)的附录D。我们在此列举部分核心局限性： * 多语言支持不足 * 饱和性 * 可靠性 * 效率 * 严谨性 * 泛化能力 * 演化性 * 交互性 ## 额外信息 ### 数据集策展人 1. Terry Yue Zhuo，莫纳什大学与CSIRO Data61实验室，邮箱：terry.zhuo@monash.edu 2. 其他BigCode项目成员包括： * Minh Chien Vu * Jenny Chim * Han Hu * Wenhao Yu * Ratnadira Widyasari * Imam Nur Bani Yusuf * Haolan Zhan * Junda He * Indraneil Paul * Simon Brunner * Chen Gong * Thong Hoang * Armel Randy Zebaze * Xiaoheng Hong * Wen-Ding Li * Jean Kaddour * Ming Xu * Zhihan Zhang * Prateek Yadav * Niklas Muennighoff ### 许可信息 BigCodeBench 采用Apache License, Version 2.0协议进行许可。 ### 引用信息 bibtex @article{zhuo2024bigcodebench, title={BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions}, author={Zhuo, Terry Yue and Vu, Minh Chien and Chim, Jenny and Hu, Han and Yu, Wenhao and Widyasari, Ratnadira and Yusuf, Imam Nur Bani and Zhan, Haolan and He, Junda and Paul, Indraneil and others}, journal={arXiv preprint arXiv:2406.15877}, year={2024} }

提供机构：

maas

创建时间：

2025-10-11

搜集汇总

数据集介绍