prometheus-eval/Feedback-Collection

Name: prometheus-eval/Feedback-Collection
Creator: prometheus-eval
Published: 2023-10-14 14:53:22
License: 暂无描述

Hugging Face2023-10-14 更新2024-06-15 收录

下载链接：

https://hf-mirror.com/datasets/prometheus-eval/Feedback-Collection

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 task_categories: - text-generation - text-classification language: - en size_categories: - 10K<n<100K configs: - config_name: default data_files: - split: train path: "new_feedback_collection.json" --- ## Dataset Description - **Homepage:https://github.com/kaistAI/Prometheus** - **Repository:https://github.com/kaistAI/Prometheus** - **Paper:https://arxiv.org/abs/2310.08491** - **Point of Contact:seungone@kaist.ac.kr** # Dataset Card ### Dataset Summary The Feedback Collection is a dataset designed to induce fine-grained evaluation capabilities into language models.\\ ![plot](./feedback_collection.JPG) Recently, proprietary LLMs (e.g., GPT-4) have been used to evaluate long-form responses. In our experiments, we found that open-source LMs are not capable of evaluating long-form responses, showing low correlation with both human evaluators and GPT-4.\\ In our paper, we found that by (1) fine-tuning feedback generated by GPT-4 and (2) including the appropriate reference materials (reference answers & score rubrics), we can effectively induce fine-grained evaluation into open-source LMs. The Feedback Collection provides 1K score rubrics, 20K instructions & reference answers, 100K responses & feedback (20K for each score in the range 1-5).\\ Experimental results show that Prometheus (a LM obtained by fine-tuning Llama-2-Chat on the Feedback Collection) can function as an evaluator in both an absolute scoring setting and a ranking scoring setting. ### Languages English ## Dataset Structure * instruction: The input that is given to the evaluator LM. It includes the instruction & response to evaluate, the reference answer, the score rubric. * output: The output that the evaluator LM should generate. It includes the feedback and score decision divided by a phrase ```[RESULT]```. * orig```_```instruction: The instruction to be evaluated. Note that this differs with the instruction that includes all the components. * orig```_```response: The response to be evaluated. * orig```_```reference```_```answer: A reference answer to the orig```_```instruction. * orig```_```criteria: The score criteria used to evaluate the orig```_``` response. * orig```_```score1```_```description: A description of when to give a score of 1 to the orig```_```response. * orig```_```score2```_```description: A description of when to give a score of 2 to the orig```_```response. * orig```_```score3```_```description: A description of when to give a score of 3 to the orig```_```response. * orig```_```score4```_```description: A description of when to give a score of 4 to the orig```_```response. * orig```_```score5```_```description: A description of when to give a score of 5 to the orig```_```response. * orig```_```feedback: A feedback that critiques the orig```_```response. * orig```_```score: An integer between 1 and 5 given to the orig```_```response. In our paper, we trained the input using the following prompt format (already processed in the 'instruction'): ``` ###Task Description: An instruction (might include an Input inside it), a response to evaluate, a reference answer that gets a score of 5, and a score rubric representing a evaluation criteria are given. 1. Write a detailed feedback that assess the quality of the response strictly based on the given score rubric, not evaluating in general. 2. After writing a feedback, write a score that is an integer between 1 and 5. You should refer to the score rubric. 3. The output format should look as follows: \"Feedback: (write a feedback for criteria) [RESULT] (an integer number between 1 and 5)\" 4. Please do not generate any other opening, closing, and explanations. ###The instruction to evaluate: {orig_instruction} ###Response to evaluate: {orig_response} ###Reference Answer (Score 5): {orig_reference_answer} ###Score Rubrics: [{orig_criteria}] Score 1: {orig_score1_description} Score 2: {orig_score2_description} Score 3: {orig_score3_description} Score 4: {orig_score4_description} Score 5: {orig_score5_description} ###Feedback: ``` The following prompt format (already processed in the 'output') was used to train the evaluator LM: ``` {orig_feedback} [RESULT] {orig_score} ``` Then during evaluation, we parsed the prediction after the phrase ```[RESULT]```. ### Data Splits | name | train | |-------------------|------:| |Feedback-Collection|99,952| ### Citation Information If you find the following model helpful, please consider citing our paper! ```bibtex @misc{kim2023prometheus, title={Prometheus: Inducing Fine-grained Evaluation Capability in Language Models}, author={Seungone Kim and Jamin Shin and Yejin Cho and Joel Jang and Shayne Longpre and Hwaran Lee and Sangdoo Yun and Seongjin Shin and Sungdong Kim and James Thorne and Minjoon Seo}, year={2023}, eprint={2310.08491}, archivePrefix={arXiv}, primaryClass={cs.CL} } ```

提供机构：

prometheus-eval

原始信息汇总

数据集描述

数据集概述

Feedback Collection 是一个旨在提升语言模型细粒度评估能力的数据集。该数据集通过（1）GPT-4 生成的反馈微调，以及（2）包含适当的参考材料（参考答案和评分标准），有效地将细粒度评估引入开源语言模型中。

数据集包含 1K 评分标准，20K 指令和参考答案，以及 100K 响应和反馈（每个分数范围 1-5 对应 20K 条）。实验结果显示，通过在 Feedback Collection 上微调 Llama-2-Chat 得到的 Prometheus 语言模型，能够在绝对评分和排序评分设置下作为评估器使用。

语言

英语

数据集结构

instruction: 给评估语言模型的输入，包括评估的指令和响应、参考答案、评分标准。
output: 评估语言模型应生成的输出，包括反馈和评分决策，由短语 [RESULT] 分隔。
orig_instruction: 待评估的指令。注意这与包含所有组件的指令不同。
orig_response: 待评估的响应。
orig_reference_answer: 对应 orig_instruction 的参考答案。
orig_criteria: 用于评估 orig_response 的评分标准。
orig_score1_description: 给出分数 1 的描述。
orig_score2_description: 给出分数 2 的描述。
orig_score3_description: 给出分数 3 的描述。
orig_score4_description: 给出分数 4 的描述。
orig_score5_description: 给出分数 5 的描述。
orig_feedback: 对 orig_response 的反馈。
orig_score: 给 orig_response 的整数评分，范围在 1 到 5 之间。

数据分割

名称	训练集
Feedback-Collection	99,952

引用信息

如果该模型对您有帮助，请考虑引用我们的论文：

bibtex @misc{kim2023prometheus, title={Prometheus: Inducing Fine-grained Evaluation Capability in Language Models}, author={Seungone Kim and Jamin Shin and Yejin Cho and Joel Jang and Shayne Longpre and Hwaran Lee and Sangdoo Yun and Seongjin Shin and Sungdong Kim and James Thorne and Minjoon Seo}, year={2023}, eprint={2310.08491}, archivePrefix={arXiv}, primaryClass={cs.CL} }

5,000+

优质数据集

54 个

任务类型

进入经典数据集