mbpp

Name: mbpp
Creator: maas
Published: 2026-05-23 23:53:07
License: 暂无描述

魔搭社区2026-05-23 更新2024-05-15 收录

下载链接：

https://modelscope.cn/datasets/google-research-datasets/mbpp

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Card for Mostly Basic Python Problems (mbpp) ## Table of Contents - [Dataset Card for Mostly Basic Python Problems (mbpp)](#dataset-card-for-mostly-basic-python-problems-(mbpp)) - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Initial Data Collection and Normalization](#initial-data-collection-and-normalization) - [Who are the source language producers?](#who-are-the-source-language-producers) - [Annotations](#annotations) - [Annotation process](#annotation-process) - [Who are the annotators?](#who-are-the-annotators) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Repository:** https://github.com/google-research/google-research/tree/master/mbpp - **Paper:** [Program Synthesis with Large Language Models](https://arxiv.org/abs/2108.07732) ### Dataset Summary The benchmark consists of around 1,000 crowd-sourced Python programming problems, designed to be solvable by entry level programmers, covering programming fundamentals, standard library functionality, and so on. Each problem consists of a task description, code solution and 3 automated test cases. As described in the paper, a subset of the data has been hand-verified by us. Released [here](https://github.com/google-research/google-research/tree/master/mbpp) as part of [Program Synthesis with Large Language Models, Austin et. al., 2021](https://arxiv.org/abs/2108.07732). ### Supported Tasks and Leaderboards This dataset is used to evaluate code generations. ### Languages English - Python code ## Dataset Structure ```python dataset_full = load_dataset("mbpp") DatasetDict({ test: Dataset({ features: ['task_id', 'text', 'code', 'test_list', 'test_setup_code', 'challenge_test_list'], num_rows: 974 }) }) dataset_sanitized = load_dataset("mbpp", "sanitized") DatasetDict({ test: Dataset({ features: ['source_file', 'task_id', 'prompt', 'code', 'test_imports', 'test_list'], num_rows: 427 }) }) ``` ### Data Instances #### mbpp - full ``` { 'task_id': 1, 'text': 'Write a function to find the minimum cost path to reach (m, n) from (0, 0) for the given cost matrix cost[][] and a position (m, n) in cost[][].', 'code': 'R = 3\r\nC = 3\r\ndef min_cost(cost, m, n): \r\n\ttc = [[0 for x in range(C)] for x in range(R)] \r\n\ttc[0][0] = cost[0][0] \r\n\tfor i in range(1, m+1): \r\n\t\ttc[i][0] = tc[i-1][0] + cost[i][0] \r\n\tfor j in range(1, n+1): \r\n\t\ttc[0][j] = tc[0][j-1] + cost[0][j] \r\n\tfor i in range(1, m+1): \r\n\t\tfor j in range(1, n+1): \r\n\t\t\ttc[i][j] = min(tc[i-1][j-1], tc[i-1][j], tc[i][j-1]) + cost[i][j] \r\n\treturn tc[m][n]', 'test_list': [ 'assert min_cost([[1, 2, 3], [4, 8, 2], [1, 5, 3]], 2, 2) == 8', 'assert min_cost([[2, 3, 4], [5, 9, 3], [2, 6, 4]], 2, 2) == 12', 'assert min_cost([[3, 4, 5], [6, 10, 4], [3, 7, 5]], 2, 2) == 16'], 'test_setup_code': '', 'challenge_test_list': [] } ``` #### mbpp - sanitized ``` { 'source_file': 'Benchmark Questions Verification V2.ipynb', 'task_id': 2, 'prompt': 'Write a function to find the shared elements from the given two lists.', 'code': 'def similar_elements(test_tup1, test_tup2):\n res = tuple(set(test_tup1) & set(test_tup2))\n return (res) ', 'test_imports': [], 'test_list': [ 'assert set(similar_elements((3, 4, 5, 6),(5, 7, 4, 10))) == set((4, 5))', 'assert set(similar_elements((1, 2, 3, 4),(5, 4, 3, 7))) == set((3, 4))', 'assert set(similar_elements((11, 12, 14, 13),(17, 15, 14, 13))) == set((13, 14))' ] } ``` ### Data Fields - `source_file`: unknown - `text`/`prompt`: description of programming task - `code`: solution for programming task - `test_setup_code`/`test_imports`: necessary code imports to execute tests - `test_list`: list of tests to verify solution - `challenge_test_list`: list of more challenging test to further probe solution ### Data Splits There are two version of the dataset (full and sanitized), each with four splits: - train - evaluation - test - prompt The `prompt` split corresponds to samples used for few-shot prompting and not for training. ## Dataset Creation See section 2.1 of original [paper](https://arxiv.org/abs/2108.07732). ### Curation Rationale In order to evaluate code generation functions a set of simple programming tasks as well as solutions is necessary which this dataset provides. ### Source Data #### Initial Data Collection and Normalization The dataset was manually created from scratch. #### Who are the source language producers? The dataset was created with an internal crowdsourcing effort at Google. ### Annotations #### Annotation process The full dataset was created first and a subset then underwent a second round to improve the task descriptions. #### Who are the annotators? The dataset was created with an internal crowdsourcing effort at Google. ### Personal and Sensitive Information None. ## Considerations for Using the Data Make sure you execute generated Python code in a safe environment when evauating against this dataset as generated code could be harmful. ### Social Impact of Dataset With this dataset code generating models can be better evaluated which leads to fewer issues introduced when using such models. ### Discussion of Biases ### Other Known Limitations Since the task descriptions might not be expressive enough to solve the task. The `sanitized` split aims at addressing this issue by having a second round of annotators improve the dataset. ## Additional Information ### Dataset Curators Google Research ### Licensing Information CC-BY-4.0 ### Citation Information ``` @article{austin2021program, title={Program Synthesis with Large Language Models}, author={Austin, Jacob and Odena, Augustus and Nye, Maxwell and Bosma, Maarten and Michalewski, Henryk and Dohan, David and Jiang, Ellen and Cai, Carrie and Terry, Michael and Le, Quoc and others}, journal={arXiv preprint arXiv:2108.07732}, year={2021} ``` ### Contributions Thanks to [@lvwerra](https://github.com/lvwerra) for adding this dataset.

# 数据集卡片（Dataset Card）：基础Python编程问题数据集（Mostly Basic Python Problems, 简称MBPP） ## 目录 - [数据集卡片（Dataset Card）：基础Python编程问题数据集（Mostly Basic Python Problems, 简称MBPP）](#dataset-card-for-mostly-basic-python-problems-(mbpp)) - [目录](#table-of-contents) - [数据集描述](#dataset-description) - [数据集概述](#dataset-summary) - [支持任务与排行榜](#supported-tasks-and-leaderboards) - [语言](#languages) - [数据集结构](#dataset-structure) - [数据实例](#data-instances) - [数据字段](#data-fields) - [数据划分](#data-splits) - [数据集构建](#dataset-creation) - [构建初衷](#curation-rationale) - [源数据](#source-data) - [初始数据收集与标准化](#initial-data-collection-and-normalization) - [源语言内容创作者](#who-are-the-source-language-producers) - [标注信息](#annotations) - [标注流程](#annotation-process) - [标注者](#who-are-the-annotators) - [个人与敏感信息](#personal-and-sensitive-information) - [数据集使用注意事项](#considerations-for-using-the-data) - [数据集的社会影响](#social-impact-of-dataset) - [偏差讨论](#discussion-of-biases) - [其他已知局限性](#other-known-limitations) - [附加信息](#additional-information) - [数据集维护者](#dataset-curators) - [许可信息](#licensing-information) - [引用信息](#citation-information) - [贡献](#contributions) ## 数据集描述 - **代码仓库（Repository）**: https://github.com/google-research/google-research/tree/master/mbpp - **相关论文**: [《大语言模型的程序合成》（Program Synthesis with Large Language Models）](https://arxiv.org/abs/2108.07732) ### 数据集概述该基准数据集包含约1000个众包生成的Python编程题目，专为入门级程序员设计，涵盖编程基础、标准库功能等内容。每道题目包含任务描述、代码解决方案与3个自动化测试用例。如论文所述，该数据集的一个子集已由人工核验。本数据集随《Program Synthesis with Large Language Models》（Austin等人，2021）一同发布，可在此处获取：https://github.com/google-research/google-research/tree/master/mbpp。 ### 支持任务与排行榜本数据集用于评估代码生成模型。 ### 语言英语 - Python代码 ## 数据集结构 python dataset_full = load_dataset("mbpp") DatasetDict({ test: Dataset({ features: ['task_id', 'text', 'code', 'test_list', 'test_setup_code', 'challenge_test_list'], num_rows: 974 }) }) dataset_sanitized = load_dataset("mbpp", "sanitized") DatasetDict({ test: Dataset({ features: ['source_file', 'task_id', 'prompt', 'code', 'test_imports', 'test_list'], num_rows: 427 }) }) ### 数据实例 #### MBPP - 完整版本 { 'task_id': 1, 'text': '编写一个函数，针对给定的代价矩阵cost[][]与位置(m, n)，求从(0, 0)到(m, n)的最小代价路径。', 'code': 'R = 3 C = 3 def min_cost(cost, m, n): tc = [[0 for x in range(C)] for x in range(R)] tc[0][0] = cost[0][0] for i in range(1, m+1): tc[i][0] = tc[i-1][0] + cost[i][0] for j in range(1, n+1): tc[0][j] = tc[0][j-1] + cost[0][j] for i in range(1, m+1): for j in range(1, n+1): tc[i][j] = min(tc[i-1][j-1], tc[i-1][j], tc[i][j-1]) + cost[i][j] return tc[m][n]', 'test_list': [ 'assert min_cost([[1, 2, 3], [4, 8, 2], [1, 5, 3]], 2, 2) == 8', 'assert min_cost([[2, 3, 4], [5, 9, 3], [2, 6, 4]], 2, 2) == 12', 'assert min_cost([[3, 4, 5], [6, 10, 4], [3, 7, 5]], 2, 2) == 16'], 'test_setup_code': '', 'challenge_test_list': [] } #### MBPP - 净化版本 { 'source_file': 'Benchmark Questions Verification V2.ipynb', 'task_id': 2, 'prompt': '编写一个函数，从给定的两个列表中找出共享元素。', 'code': 'def similar_elements(test_tup1, test_tup2): res = tuple(set(test_tup1) & set(test_tup2)) return (res) ', 'test_imports': [], 'test_list': [ 'assert set(similar_elements((3, 4, 5, 6),(5, 7, 4, 10))) == set((4, 5))', 'assert set(similar_elements((1, 2, 3, 4),(5, 4, 3, 7))) == set((3, 4))', 'assert set(similar_elements((11, 12, 14, 13),(17, 15, 14, 13))) == set((13, 14))' ] } ### 数据字段 - `source_file`: 源文件 - `text`/`prompt`: 编程任务描述 - `code`: 编程任务解决方案代码 - `test_setup_code`/`test_imports`: 执行测试所需的前置代码（含导入语句） - `test_list`: 用于验证解决方案的测试用例列表 - `challenge_test_list`: 用于进一步探查解决方案的高难度测试用例列表 ### 数据划分本数据集包含两个版本（完整版本与净化版本），每个版本均包含四个划分： - 训练集（train） - 评估集（evaluation） - 测试集（test） - 提示集（prompt）其中提示集（prompt）对应用于少样本（Few-shot）提示而非训练的样本。 ## 数据集构建详见原始论文[《Program Synthesis with Large Language Models》](https://arxiv.org/abs/2108.07732)的2.1节。 ### 构建初衷为了评估代码生成模型，需要一套简单的编程任务及对应解决方案的集合，本数据集恰好满足这一需求。 ### 源数据 #### 初始数据收集与标准化本数据集完全由人工从头构建。 #### 源语言内容创作者本数据集通过谷歌（Google）内部的众包项目创建。 ### 标注信息 #### 标注流程完整数据集首先被构建完成，随后选取其中一个子集进行第二轮标注以优化任务描述。 #### 标注者本数据集通过谷歌（Google）内部的众包项目创建。 ### 个人与敏感信息无。 ## 数据集使用注意事项在使用本数据集评估生成的Python代码时，请确保在安全环境中执行代码，因为生成的代码可能存在安全风险。 ### 数据集的社会影响通过本数据集，可以更好地评估代码生成模型，从而降低使用此类模型时引入的问题。 ### 偏差讨论 ### 其他已知局限性由于任务描述可能不足以清晰说明任务要求，净化版本（sanitized）通过第二轮标注优化任务描述，以解决这一问题。 ## 附加信息 ### 数据集维护者谷歌研究院（Google Research） ### 许可信息 CC-BY-4.0 ### 引用信息 @article{austin2021program, title={Program Synthesis with Large Language Models}, author={Austin, Jacob and Odena, Augustus and Nye, Maxwell and Bosma, Maarten and Michalewski, Henryk and Dohan, David and Jiang, Ellen and Cai, Carrie and Terry, Michael and Le, Quoc and others}, journal={arXiv preprint arXiv:2108.07732}, year={2021} } ### 贡献感谢[@lvwerra](https://github.com/lvwerra)为本数据集添加至数据集仓库。

提供机构：

maas

创建时间：

2025-07-07

搜集汇总

数据集介绍