five

Leon-Leee/Code-Feedback-decontamination

收藏
Hugging Face2024-03-28 更新2024-06-11 收录
下载链接:
https://hf-mirror.com/datasets/Leon-Leee/Code-Feedback-decontamination
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 task_categories: - text-generation - question-answering language: - en tags: - code - croissant size_categories: - 10K<n<100K --- A decontaminated version of [m-a-p/Code-Feedback](https://huggingface.co/datasets/m-a-p/Code-Feedback). The excluded (28) files are "contaminated" with only two code segments: 1. simple GCD function: `while b: a, b = b, a % b return a` 2. sum_to_n solution: `return sum(range(n + 1))` And reformated to *sharegpt*. Decontamination is done in the same way as [Magicoder](https://github.com/ise-uiuc/magicoder/tree/main/src/magicoder/decontamination) (ie., bigcode decontamination process), which uses a substring-match-finding method to find overlaps between a target dataset and the following standard benchmarks: - HumanEval - MBPP - codeparrot/apps - gsm8k - ds-1000 One should notice that MultiPL-E is not included because it's from HumanEval and MBPP; "solutions" from apps are not included because the dataset is too large and it takes very much long time.
提供机构:
Leon-Leee
原始信息汇总

数据集概述

基本信息

  • 许可证: Apache-2.0
  • 任务类别:
    • 文本生成
    • 问答
  • 语言: 英语
  • 标签:
    • 代码
    • 可颂(croissant)
  • 大小分类: 10K<n<100K

数据集描述

  • 该数据集是m-a-p/Code-Feedback的净化版本。
  • 排除了28个文件,这些文件包含以下两个代码段:
    1. 简单GCD函数:while b: a, b = b, a % b return a
    2. sum_to_n解决方案:return sum(range(n + 1))
  • 数据集已重新格式化为sharegpt

净化方法

  • 净化过程类似于Magicoder,采用子字符串匹配方法,以查找目标数据集与以下标准基准之间的重叠:
    • HumanEval
    • MBPP
    • codeparrot/apps
    • gsm8k
    • ds-1000
  • 未包括MultiPL-E,因其源自HumanEval和MBPP;未包括来自apps的“解决方案”,因数据集过大,处理时间过长。
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作