LangAGI-Lab/COFFEE-Dataset

Name: LangAGI-Lab/COFFEE-Dataset
Creator: LangAGI-Lab
Published: 2024-04-08 01:50:29
License: 暂无描述

Hugging Face2024-04-08 更新2025-04-12 收录

下载链接：

https://hf-mirror.com/datasets/LangAGI-Lab/COFFEE-Dataset

下载链接

链接失效反馈

官方服务：

资源简介：

--- configs: - config_name: default data_files: - split: train path: data/train-* - split: eval path: data/eval-* dataset_info: features: - name: diff_score dtype: float64 - name: feedback dtype: string - name: problem_id dtype: string - name: wrong_code dtype: string - name: correct_code dtype: string - name: input_format dtype: string - name: index dtype: int64 - name: variable_overlap dtype: float64 - name: description dtype: string - name: output_format dtype: string - name: user_id dtype: string - name: metadata struct: - name: 맞힌 사람 dtype: string - name: 메모리 제한 dtype: string - name: 시간 제한 dtype: string - name: 정답 dtype: string - name: 정답 비율 dtype: string - name: 제출 dtype: string - name: language dtype: string splits: - name: train num_bytes: 109928745 num_examples: 40586 - name: eval num_bytes: 11223340 num_examples: 4196 download_size: 38570356 dataset_size: 121152085 --- # Dataset Card for "COFFEE-Dataset" This is the official dataset for [COFFEE: Boost Your Code LLMs by Fixing Bugs with Feedback](https://arxiv.org/pdf/2311.07215.pdf) COFFEE dataset is built for training a critic that generates natural language feedback given an erroneous code. Overall Filtered ratio: 12.65% Short Feedback: 0.00% (0 samples) stdin readline present: 1.37% (639 samples) Low Diff Score: 7.79% (3622 samples) Low Variable Overlap: 1.75% (813 samples) Variable Name: 1.74% (807 samples) The number of problem id in train, eval split, respectively: train: 739 eval: 578 ![an example instance from COFFEE dataset](./coffee_example.svg) [More Information needed](https://github.com/huggingface/datasets/blob/main/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)

configs: - 配置名称: default 数据文件: - 分割: train 路径: data/train-* - 分割: eval 路径: data/eval-* dataset_info: 特征: - 名称: 差异分数数据类型: 双精度浮点数 - 名称: 反馈数据类型: 字符串 - 名称: 问题ID 数据类型: 字符串 - 名称: 错误代码数据类型: 字符串 - 名称: 正确代码数据类型: 字符串 - 名称: 输入格式数据类型: 字符串 - 名称: 索引数据类型: 64位整数 - 名称: 变量重叠度数据类型: 双精度浮点数 - 名称: 描述数据类型: 字符串 - 名称: 输出格式数据类型: 字符串 - 名称: 用户ID 数据类型: 字符串 - 名称: 元数据结构: - 名称: 答对的人数据类型: 字符串 - 名称: 内存限制数据类型: 字符串 - 名称: 时间限制数据类型: 字符串 - 名称: 正确答案数据类型: 字符串 - 名称: 正确率数据类型: 字符串 - 名称: 提交数据类型: 字符串 - 名称: 语言数据类型: 字符串分割: - 名称: train 字节数: 109928745 样本数: 40586 - 名称: eval 字节数: 11223340 样本数: 4196 下载大小: 38570356 数据集大小: 121152085 # COFFEE-Dataset数据集卡片这是论文《COFFEE: 通过修复错误并提供反馈提升代码大语言模型性能》（https://arxiv.org/pdf/2311.07215.pdf）的官方数据集 COFFEE数据集用于训练评论家模型，该模型可针对错误代码生成自然语言反馈。整体过滤比例: 12.65% 短反馈: 0.00%（0样本）包含stdin readline: 1.37%（639样本）低差异分数:7.79%（3622样本）低变量重叠度:1.75%（813样本）变量名称:1.74%（807样本）训练集和验证集分割中的问题ID数量分别为: 训练集:739 验证集:578 ![COFFEE数据集的示例实例](./coffee_example.svg) [更多信息请参见](https://github.com/huggingface/datasets/blob/main/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)

提供机构：

LangAGI-Lab

5,000+

优质数据集

54 个

任务类型

进入经典数据集