Leon-Leee/Code-Feedback-decontamination
收藏Hugging Face2024-03-28 更新2024-06-11 收录
下载链接:
https://hf-mirror.com/datasets/Leon-Leee/Code-Feedback-decontamination
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
task_categories:
- text-generation
- question-answering
language:
- en
tags:
- code
- croissant
size_categories:
- 10K<n<100K
---
A decontaminated version of [m-a-p/Code-Feedback](https://huggingface.co/datasets/m-a-p/Code-Feedback).
The excluded (28) files are "contaminated" with only two code segments:
1. simple GCD function: `while b: a, b = b, a % b return a`
2. sum_to_n solution: `return sum(range(n + 1))`
And reformated to *sharegpt*.
Decontamination is done in the same way as [Magicoder](https://github.com/ise-uiuc/magicoder/tree/main/src/magicoder/decontamination) (ie., bigcode decontamination process), which uses a substring-match-finding method to find overlaps between a target dataset and the following standard benchmarks:
- HumanEval
- MBPP
- codeparrot/apps
- gsm8k
- ds-1000
One should notice that MultiPL-E is not included because it's from HumanEval and MBPP; "solutions" from apps are not included because the dataset is too large and it takes very much long time.
提供机构:
Leon-Leee
原始信息汇总
数据集概述
基本信息
- 许可证: Apache-2.0
- 任务类别:
- 文本生成
- 问答
- 语言: 英语
- 标签:
- 代码
- 可颂(croissant)
- 大小分类: 10K<n<100K
数据集描述
- 该数据集是m-a-p/Code-Feedback的净化版本。
- 排除了28个文件,这些文件包含以下两个代码段:
- 简单GCD函数:
while b: a, b = b, a % b return a - sum_to_n解决方案:
return sum(range(n + 1))
- 简单GCD函数:
- 数据集已重新格式化为sharegpt。
净化方法
- 净化过程类似于Magicoder,采用子字符串匹配方法,以查找目标数据集与以下标准基准之间的重叠:
- HumanEval
- MBPP
- codeparrot/apps
- gsm8k
- ds-1000
- 未包括MultiPL-E,因其源自HumanEval和MBPP;未包括来自apps的“解决方案”,因数据集过大,处理时间过长。



