NeuroDragon/BuggedPythonLeetCode

Name: NeuroDragon/BuggedPythonLeetCode
Creator: NeuroDragon
Published: 2023-07-24 11:05:59
License: 暂无描述

Hugging Face2023-07-24 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/NeuroDragon/BuggedPythonLeetCode

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 task_categories: - text-generation - question-answering language: - en tags: - code size_categories: - 10K<n<100K --- # Dataset Description edit: fixed some bugs with datasets not handling all pyarrow types. ## Dataset Summary This dataset consists of Python coding problems from LeetCode, which have been bugged using the [OpenBugger](https://github.com/furlat/OpenBugger) package. This dataset provides a unique opportunity to study the debugging process in a controlled and replicable environment. For each correct code snippet, 15 bugged versions were attempted. For each succesfully bugged version, a corresponding question mimicking a beginner coder's perspective was generated, creating a Q/A pair. In addition to the code and question, each data entry contains the task description, the bug's location, and debugging instructions. Finally the code snippets are wrapped in python markdown headers and the conversation is structured using the [ChatML](https://github.com/openai/openai-python/blob/main/chatml.md) format. Highest quality data for training LLM are found in https://huggingface.co/datasets/NeuroDragon/BuggedPythonLeetCode/blob/main/train/bugged_leetcode_no_replaced.parquet ## Supported Tasks This dataset supports a variety of tasks: - Code Debugging: Predict the correct code snippet given the bugged code and the question. - Question Answering: Given the task description and bugged code, generate the question. - Code Generation: Given the task description, question, and debugging instructions, generate the correct code. - CST Generation: Given the code, generate the concrete syntax tree. ## Languages The text in the dataset is in English, and the code is in Python. # Dataset Structure ## Data Instances The core of the dataset is constructed around the following concepts, refer to the dataframes columns headers for the specific names: - correct_code: The original, correct Python code snippet. - task_description: A brief description of the coding task. - bugged_code: The bugged version of the correct code. - bugs_location: The location(s) of the bug(s) in the code. - debugging_instructions: Instructions to help debug the code. - question: A question related to the bugged code, mimicking a beginner's query. - answer: The answer to the question, which alawys contain the original correct code or a chunk of GPT-generated code that matches the original up to linting and comments. ## Data Splits The dataset is split into five files: - bugged_leetcode_all_conversations_with_embeddings.parquet - bugged_leetcode_all_conversations.parquet - bugged_leetcode_no_replaced_with_embeddings.parquet - bugged_leetcode_no_replaced.parquet - bugged_leetcode_all_steps.parquet ## Data Generation Process The data was generated using a combination of LeetCode problem data, the OpenBugger package, and the GPT 3.5 model. The original code was bugged using OpenBugger, and then the GPT model was used to generate a question and answer based on the bugged code and task description in order to limit GPT contribution to the natural language and not the coding aspect of the dataset. Additional processing ensured that the final answer was a compilable Python code and that corresponded to the original leetcode solution. # Dataset Creation ## Curation Rationale This dataset was curated to provide a large-scale, diverse set of Python programming problems and their bugged versions, which could be used for developing and evaluating models for debugging, code generation, and question answering. ## Dataset Source The original coding problems were sourced from [leetcode-solutions-python](https://huggingface.co/datasets/mhhmm/leetcode-solutions-python ). ## Licensing Information Please refer to the licensing information of the original dataset. # Dataset Usage ## Usage Caveats Users should be aware that the questions in this dataset contain some stereotypical phrases, and may benefit from checking for n-gram distributions and filtering the spikes. Multiple post-processing steps have already been applied, but better safe than sorry. # Dataset Maintenance ## Contact Information Please contact the [original author](https://github.com/furlat) for any questions or concerns related to this dataset. ## Dataset Updates This is a static dataset that does not receive updates.

提供机构：

NeuroDragon

原始信息汇总

数据集描述

数据集概述

该数据集包含来自LeetCode的Python编程问题，这些问题已经使用OpenBugger包进行了错误注入。数据集提供了一个在受控和可复制环境中研究调试过程的独特机会。每个正确的代码片段尝试了15个错误版本，并为每个成功错误版本生成一个模仿初学者编码者视角的问题，形成了一个问答对。除了代码和问题外，每个数据条目还包含任务描述、错误位置和调试指令。代码片段被包装在Python Markdown标题中，对话使用ChatML格式进行结构化。

支持的任务

该数据集支持多种任务：

代码调试：根据错误代码和问题预测正确的代码片段。
问答：根据任务描述和错误代码生成问题。
代码生成：根据任务描述、问题和调试指令生成正确的代码。
CST生成：根据代码生成具体语法树。

语言

数据集中的文本为英语，代码为Python。

数据集结构

数据实例

数据集的核心围绕以下概念构建，具体名称请参考数据框的列标题：

correct_code：原始的正确Python代码片段。
task_description：编码任务的简要描述。
bugged_code：正确代码的错误版本。
bugs_location：代码中错误的位置。
debugging_instructions：帮助调试代码的指令。
question：与错误代码相关的问题，模仿初学者的查询。
answer：问题的答案，始终包含原始正确的代码或与原始代码匹配的GPT生成的代码片段（包括代码格式和注释）。

数据分割

数据集被分割成五个文件：

bugged_leetcode_all_conversations_with_embeddings.parquet
bugged_leetcode_all_conversations.parquet
bugged_leetcode_no_replaced_with_embeddings.parquet
bugged_leetcode_no_replaced.parquet
bugged_leetcode_all_steps.parquet

数据生成过程

数据是通过结合LeetCode问题数据、OpenBugger包和GPT 3.5模型生成的。原始代码使用OpenBugger进行错误注入，然后使用GPT模型根据错误代码和任务描述生成问题和答案，以限制GPT对自然语言的贡献，而不是代码方面。额外的处理确保最终答案是可编译的Python代码，并与原始LeetCode解决方案相对应。

数据集创建

策划理由

该数据集是为了提供一个大规模、多样化的Python编程问题及其错误版本的集合，可用于开发和评估调试、代码生成和问答模型。

数据集来源

原始编程问题来自leetcode-solutions-python。

许可信息

请参考原始数据集的许可信息。

数据集使用

使用注意事项

用户应注意，该数据集中的问题包含一些典型短语，可能需要检查n-gram分布并过滤峰值。已经应用了多个后处理步骤，但安全总比遗憾好。

数据集维护

联系信息

如有任何关于该数据集的问题或关注，请联系原始作者。

数据集更新

这是一个静态数据集，不会接收更新。

5,000+

优质数据集

54 个

任务类型

进入经典数据集