Leon-Leee/Code-Feedback-decontamination

Name: Leon-Leee/Code-Feedback-decontamination
Creator: Leon-Leee
Published: 2024-03-28 11:18:56
License: 暂无描述

Hugging Face2024-03-28 更新2024-06-11 收录

下载链接：

https://hf-mirror.com/datasets/Leon-Leee/Code-Feedback-decontamination

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 task_categories: - text-generation - question-answering language: - en tags: - code - croissant size_categories: - 10K<n<100K --- A decontaminated version of [m-a-p/Code-Feedback](https://huggingface.co/datasets/m-a-p/Code-Feedback). The excluded (28) files are "contaminated" with only two code segments: 1. simple GCD function: `while b: a, b = b, a % b return a` 2. sum_to_n solution: `return sum(range(n + 1))` And reformated to *sharegpt*. Decontamination is done in the same way as [Magicoder](https://github.com/ise-uiuc/magicoder/tree/main/src/magicoder/decontamination) (ie., bigcode decontamination process), which uses a substring-match-finding method to find overlaps between a target dataset and the following standard benchmarks: - HumanEval - MBPP - codeparrot/apps - gsm8k - ds-1000 One should notice that MultiPL-E is not included because it's from HumanEval and MBPP; "solutions" from apps are not included because the dataset is too large and it takes very much long time.

提供机构：

Leon-Leee

原始信息汇总

数据集概述

基本信息

许可证: Apache-2.0
任务类别:
- 文本生成
- 问答
语言: 英语
标签:
- 代码
- 可颂（croissant）
大小分类: 10K<n<100K

数据集描述

该数据集是m-a-p/Code-Feedback的净化版本。
排除了28个文件，这些文件包含以下两个代码段：
1. 简单GCD函数：while b: a, b = b, a % b return a
2. sum_to_n解决方案：return sum(range(n + 1))
数据集已重新格式化为sharegpt。

净化方法

净化过程类似于Magicoder，采用子字符串匹配方法，以查找目标数据集与以下标准基准之间的重叠：
- HumanEval
- MBPP
- codeparrot/apps
- gsm8k
- ds-1000
未包括MultiPL-E，因其源自HumanEval和MBPP；未包括来自apps的“解决方案”，因数据集过大，处理时间过长。

5,000+

优质数据集

54 个

任务类型

进入经典数据集