codeparrot/apps|自然语言处理数据集|编程语言生成数据集

hugging_face2022-10-20 更新2024-03-04 收录

自然语言处理

编程语言生成

下载链接：

https://hf-mirror.com/datasets/codeparrot/apps

下载链接

链接失效反馈

资源简介：

--- annotations_creators: [] language_creators: - crowdsourced - expert-generated language: ["code"] license: - mit multilinguality: - monolingual pretty_name: APPS size_categories: - unknown source_datasets: [] task_categories: - text-generation task_ids: - language-modeling --- # APPS Dataset ## Dataset Description [APPS](https://arxiv.org/abs/2105.09938) is a benchmark for code generation with 10000 problems. It can be used to evaluate the ability of language models to generate code from natural language specifications. You can also find **APPS metric** in the hub here [codeparrot/apps_metric](https://huggingface.co/spaces/codeparrot/apps_metric). ## Languages The dataset contains questions in English and code solutions in Python. ## Dataset Structure ```python from datasets import load_dataset load_dataset("codeparrot/apps") DatasetDict({ train: Dataset({ features: ['problem_id', 'question', 'solutions', 'input_output', 'difficulty', 'url', 'starter_code'], num_rows: 5000 }) test: Dataset({ features: ['problem_id', 'question', 'solutions', 'input_output', 'difficulty', 'url', 'starter_code'], num_rows: 5000 }) }) ``` ### How to use it You can load and iterate through the dataset with the following two lines of code for the train split: ```python from datasets import load_dataset import json ds = load_dataset("codeparrot/apps", split="train") sample = next(iter(ds)) # non-empty solutions and input_output features can be parsed from text format this way: sample["solutions"] = json.loads(sample["solutions"]) sample["input_output"] = json.loads(sample["input_output"]) print(sample) #OUTPUT: { 'problem_id': 0, 'question': 'Polycarp has $n$ different binary words. A word called binary if it contains only characters \'0\' and \'1\'. For example...', 'solutions': ["for _ in range(int(input())):\n n = int(input())\n mass = []\n zo = 0\n oz = 0\n zz = 0\n oo = 0\n...",...], 'input_output': {'inputs': ['4\n4\n0001\n1000\n0011\n0111\n3\n010\n101\n0\n2\n00000\n00001\n4\n01\n001\n0001\n00001\n'], 'outputs': ['1\n3 \n-1\n0\n\n2\n1 2 \n']}, 'difficulty': 'interview', 'url': 'https://codeforces.com/problemset/problem/1259/D', 'starter_code': ''} } ``` Each sample consists of a programming problem formulation in English, some ground truth Python solutions, test cases that are defined by their inputs and outputs and function name if provided, as well as some metadata regarding the difficulty level of the problem and its source. If a sample has non empty `input_output` feature, you can read it as a dictionary with keys `inputs` and `outputs` and `fn_name` if it exists, and similarily you can parse the solutions into a list of solutions as shown in the code above. You can also filter the dataset for the difficulty level: Introductory, Interview and Competition. Just pass the list of difficulties as a list. E.g. if you want the most challenging problems, you need to select the competition level: ```python ds = load_dataset("codeparrot/apps", split="train", difficulties=["competition"]) print(next(iter(ds))["question"]) #OUTPUT: """\ Codefortia is a small island country located somewhere in the West Pacific. It consists of $n$ settlements connected by ... For each settlement $p = 1, 2, \dots, n$, can you tell what is the minimum time required to travel between the king's residence and the parliament house (located in settlement $p$) after some roads are abandoned? -----Input----- The first line of the input contains four integers $n$, $m$, $a$ and $b$ ... -----Output----- Output a single line containing $n$ integers ... -----Examples----- Input 5 5 20 25 1 2 25 ... Output 0 25 60 40 20 ... ``` ### Data Fields |Field|Type|Description| |---|---|---| |problem_id|int|problem id| |question|string|problem description| |solutions|string|some python solutions| |input_output|string|Json string with "inputs" and "outputs" of the test cases, might also include "fn_name" the name of the function| |difficulty|string|difficulty level of the problem| |url|string|url of the source of the problem| |starter_code|string|starter code to include in prompts| we mention that only few samples have `fn_name` and `starter_code` specified ### Data Splits The dataset contains a train and test splits with 5000 samples each. ### Dataset Statistics * 10000 coding problems * 131777 test cases * all problems have a least one test case except 195 samples in the train split * for tests split, the average number of test cases is 21.2 * average length of a problem is 293.2 words * all files have ground-truth solutions except 1235 samples in the test split ## Dataset Creation To create the APPS dataset, the authors manually curated problems from open-access sites where programmers share problems with each other, including Codewars, AtCoder, Kattis, and Codeforces. For more details please refer to the original [paper](https://arxiv.org/pdf/2105.09938.pdf). ## Considerations for Using the Data In [AlphaCode](https://arxiv.org/pdf/2203.07814v1.pdf) the authors found that this dataset can generate many false positives during evaluation, where incorrect submissions are marked as correct due to lack of test coverage. ## Citation Information ``` @article{hendrycksapps2021, title={Measuring Coding Challenge Competence With APPS}, author={Dan Hendrycks and Steven Basart and Saurav Kadavath and Mantas Mazeika and Akul Arora and Ethan Guo and Collin Burns and Samir Puranik and Horace He and Dawn Song and Jacob Steinhardt}, journal={NeurIPS}, year={2021} } ```

提供机构：

codeparrot

原始信息汇总

数据集概述

数据集名称

APPS

数据集描述

APPS 是一个用于代码生成的基准数据集，包含10000个编程问题。该数据集用于评估语言模型从自然语言规范生成代码的能力。

语言

数据集包含英文问题和Python代码解决方案。

数据集结构

数据集分为训练集和测试集，各包含5000个样本。
每个样本包含以下特征：
- problem_id: 问题ID（整数）
- question: 问题描述（字符串）
- solutions: Python解决方案（字符串）
- input_output: 测试案例的输入输出（字符串，可能包含函数名）
- difficulty: 问题难度级别（字符串）
- url: 问题来源的URL（字符串）
- starter_code: 提示中包含的起始代码（字符串）

数据集统计

总问题数：10000
测试案例数：131777
训练集中有195个样本没有测试案例
测试集中平均每个样本的测试案例数：21.2
问题描述的平均长度：293.2字
测试集中有1235个样本没有解决方案

数据集创建

数据集由作者从多个开放访问的编程问题共享网站手动筛选和整理而成，包括Codewars, AtCoder, Kattis, 和 Codeforces。

使用注意事项

在评估中可能会产生许多误报，因为测试覆盖不足可能导致错误的提交被标记为正确。

引用信息

@article{hendrycksapps2021, title={Measuring Coding Challenge Competence With APPS}, author={Dan Hendrycks and Steven Basart and Saurav Kadavath and Mantas Mazeika and Akul Arora and Ethan Guo and Collin Burns and Samir Puranik and Horace He and Dawn Song and Jacob Steinhardt}, journal={NeurIPS}, year={2021} }

AI搜集汇总

数据集介绍

构建方式

APPS数据集的构建，是通过手动筛选开源编程社区中的编程问题，包括Codewars、AtCoder、Kattis以及Codeforces等平台，从而构建了一个涵盖10000个编程问题的数据集。每个问题都包含英文描述、Python语言的解决方案、测试用例以及问题的难度等级和来源链接等元数据。

特点

该数据集的特点在于，它是一个单语种的数据集，包含的问题描述为英文，解决方案为Python代码。数据集分为训练集和测试集，各包含5000个样本。每个样本不仅包括问题本身和解决方案，还包含测试用例的输入输出以及函数名（如果有的话），这为评估语言模型生成代码的能力提供了全面的基准。

使用方法

使用APPS数据集时，用户可以通过Hugging Face的datasets库加载整个数据集。数据集的每个样本可以通过迭代访问，并可以解析样本中的解决方案和测试用例的输入输出。此外，用户可以根据问题的难度等级对数据集进行筛选，以适应不同层次的编程挑战。

背景与挑战

背景概述

APPS数据集，全称为编程挑战评估数据集（Applications Programming Problems），是在2021年由Dan Hendrycks等研究人员创建的。该数据集旨在评估自然语言模型在生成代码方面的能力，包含10000个编程问题，涵盖从入门到竞赛不同难度级别的问题。这些问题主要来源于开放访问的编程社区，如Codewars、AtCoder等，是研究者在编程自动化和代码生成领域的重要资源。APPS数据集的发布，为相关领域的研究提供了新的基准，推动了编程语言模型的发展，对自动化编程和代码生成的研究具有重要的参考价值。

当前挑战

在研究领域问题方面，APPS数据集面临的挑战主要在于如何准确评估模型生成的代码质量，尤其是在测试覆盖不足时易产生假阳性结果。在数据集构建过程中，挑战包括问题的手动筛选和验证，以及确保提供的问题和解决方案的准确性和多样性。此外，针对不同难度级别的问题，如何有效平衡数据集的难度分布，也是构建过程中的一大挑战。

常用场景

经典使用场景

APPS数据集作为代码生成领域的基准，其主要应用于评估语言模型根据自然语言规范生成代码的能力。该数据集提供了编程问题的描述与对应的Python语言解决方案，辅以测试用例的输入输出数据，为研究者提供了一个综合性的评估平台。

解决学术问题

APPS数据集解决了如何有效衡量机器学习模型在代码生成任务上的性能问题，为学术研究提供了标准化的评测方法。其涵盖的问题难度多样，从入门到竞赛级别，使得研究者能够针对不同水平的编程任务进行模型的训练和评估。

衍生相关工作

APPS数据集衍生出了多项相关工作，如AlphaCode等，这些工作基于APPS数据集进行了模型的训练和评估，进一步推动了编程自动化领域的研究进展，为编程教育、软件开发等带来了新的视角和方法。

以上内容由AI搜集并总结生成

用户留言

有没有相关的论文或文献参考？

这个数据集是基于什么背景创建的？

数据集的作者是谁？

能帮我联系到这个数据集的作者吗？

这个数据集如何下载？

点击留言

数据主题

具身智能

数据集 4098个

机构 8个

大模型

数据集 439个

机构 10个

无人机

数据集 37个

机构 6个

指令微调

数据集 36个

机构 6个

蛋白质结构

数据集 50个

机构 8个

空间智能

数据集 21个

机构 5个

5,000+

优质数据集

54 个

任务类型

进入经典数据集

热门数据集

Google Scholar

Google Scholar是一个学术搜索引擎，旨在检索学术文献、论文、书籍、摘要和文章等。它涵盖了广泛的学科领域，包括自然科学、社会科学、艺术和人文学科。用户可以通过关键词搜索、作者姓名、出版物名称等方式查找相关学术资源。

scholar.google.com 收录

Billboard-Hot-100

该数据集包含了自1958年以来所有Billboard Hot 100榜单的历史数据，详细记录了每首歌曲的排名、日期、表演者等信息。

github 收录

TM-Senti

TM-Senti是由伦敦玛丽女王大学开发的一个大规模、远距离监督的Twitter情感数据集，包含超过1.84亿条推文，覆盖了超过七年的时间跨度。该数据集基于互联网档案馆的公开推文存档，可以完全重新构建，包括推文元数据且无缺失推文。数据集内容丰富，涵盖多种语言，主要用于情感分析和文本分类等任务。创建过程中，研究团队精心筛选了表情符号和表情，确保数据集的质量和多样性。该数据集的应用领域广泛，旨在解决社交媒体情感表达的长期变化问题，特别是在表情符号和表情使用上的趋势分析。

arXiv 收录

AIS数据集

该研究使用了多个公开的AIS数据集，这些数据集经过过滤、清理和统计分析。数据集涵盖了多种类型的船舶，并提供了关于船舶位置、速度和航向的关键信息。数据集包括来自19,185艘船舶的AIS消息，总计约6.4亿条记录。

github 收录

UniMed

UniMed是一个大规模、开源的多模态医学数据集，包含超过530万张图像-文本对，涵盖六种不同的医学成像模态：X射线、CT、MRI、超声、病理学和眼底。该数据集通过利用大型语言模型（LLMs）将特定模态的分类数据集转换为图像-文本格式，并结合现有的医学领域的图像-文本数据，以促进可扩展的视觉语言模型（VLM）预训练。

github 收录