opencompass/CriticBench

Name: opencompass/CriticBench
Creator: opencompass
Published: 2024-02-23 11:10:23
License: 暂无描述

Hugging Face2024-02-23 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/opencompass/CriticBench

下载链接

链接失效反馈

官方服务：

资源简介：

# CriticBench: Evaluating Large Language Model as Critic This repository is the official implementation of [CriticBench](https://arxiv.org/abs/2402.13764), a comprehensive benchmark for evaluating critique ability of LLMs. ## Introduction **[CriticBench: Evaluating Large Language Model as Critic](https://arxiv.org/abs/2402.13764)** Tian Lan1*, Wenwei Zhang2*, Chen Xu1, Heyan Huang1, Dahua Lin2, Kai Chen2†, Xian-ling Mao1† († Corresponding Author, * Equal Contribution) 1 Beijing Institute of Technology, 2 Shanghai AI Laboratory [![arXiv](https://img.shields.io/badge/arXiv-2307.04725-b31b1b.svg)](https://arxiv.org/abs/2402.13764) [![license](https://img.shields.io/github/license/InternLM/opencompass.svg)](./LICENSE) [[Dataset on HF](https://huggingface.co/datasets/opencompass/CriticBench)] [[Project Page](https://open-compass.github.io/CriticBench/)] [[Subjective LeaderBoard](https://open-compass.github.io/CriticBench/leaderboard_subjective.html)] [[Objective LeaderBoard](https://open-compass.github.io/CriticBench/leaderboard_objective.html)] > Critique ability are crucial in the scalable oversight and self-improvement of Large Language Models (LLMs). While many recent studies explore the critique ability of LLMs to judge and refine flaws in generations, how to comprehensively and reliably measure the critique abilities of LLMs is under-explored. This paper introduces CriticBench, a novel benchmark designed to comprehensively and reliably evaluate four key critique ability dimensions of LLMs: feedback, comparison, refinement and meta-feedback. CriticBench encompasses nine diverse tasks, each assessing the LLMs' ability to critique responses at varying levels of quality granularity. Our extensive evaluations of open-source and closed-source LLMs reveal intriguing relationships between the critique ability and tasks, response qualities, and model scales. <img src="./figs/overview.png" alt="overview" align=center /> ## What's New * **[2024.2.21]** Release paper, codes, data and other resources of CriticBench v1.3. ## Next - [ ] Evaluate Qwen-1.5 series models - [ ] Improve the reliability of subjective evaulation in CriticBench (v1.4) - [ ] Expand to more diverse tasks - [ ] Expand to Chinese applications - [ ] Prepare and clean the codebase for OpenCompass - [ ] Release the train set of CriticBench - [ ] Support inference on Opencompass. ## Quick Start ### 1. Prepare #### 1.1 Prepare Dataset Download the dataset from [huggingface dataset](https://huggingface.co/datasets/opencompass/CriticBench) by running this command: ```bash mkdir data cd data git clone https://huggingface.co/datasets/opencompass/CriticBench ``` which get into the `data` folder and clone the CriticBench dataset. Note that the human-annotated Likert scores, preference labels, and critiques in `test` set are excluded. You can submit your inference results on the `test` set (via run codes under `inference` folder) to this [email](lantiangmftby@gmail.com). We will run your predictions and update the results in our leaderboard. Please also provide the scale of your tested model. The structure of your submission should be similar to that in `example_data`. #### 1.2 Prepare Code and Env ```bash git clone https://github.com/open-compass/CriticBench.git # prepare the env for evaluation toolkit cd critic_bench pip install -r requirements.txt # prepare the env for LLM inference cd ../inference pip install -r requirements.txt ``` ### 2. Inference LLMs on CriticBench You need to inference LLMs to be evaluated on our proposed CriticBench, and generation results on CriticBench can be found in `inference/outputs` folder. If you are interested with our prompts for LLM, they are shown in [inference/utils/prompts.py](inference/utils/prompts.py). Specifically, the inference code should be like: ```python # this line loads all the evaluation dataset in CriticBench from `inference/utils` datasets = load_all_datasets(args['data_dir']) # these lines init the tokenizer and models from huggingface tokenizer = AutoTokenizer.from_pretrained( args['model_name'], trust_remote_code=True ) model = AutoModelForCausalLM.from_pretrained( args['model_name'], device_map="auto", trust_remote_code=True ).cuda().eval() ... # inference the LLM and save the results in json file format for abbr, dataset in tqdm(datasets.items()): path = os.path.join(folder_path, abbr + ".json") results = {} for item in tqdm(dataset['dev']): # If you want to inference other LLMs, please revise this line response, history = model.chat(tokenizer, item['question'], history=[]) results[str(len(results))] = { 'origin_prompt': item['question'], 'prediction': response } # save the results into json file, with the abbr as the file name with open(path, 'w') as f: json.dump(results, f, ensure_ascii=False, indent=4) ``` We only provide the inference codebase for our [InternLM2-7B-Chat](https://huggingface.co/internlm/internlm2-chat-7b), but it is easy to revise the inference code for evaluating your own LLMs (more details are in [inference/internlm2.py](./inference/internlm2.py)). #### Example Inference Data of Representative LLMs We have already released the generation results of some representative LLMs on CriticBench, and you could found them in [example_data/prediction_v1.3.tgz](example_data/prediction_v1.3.tgz). ```bash tar -xzvf example_data/prediction_v1.3.tgz ``` After unzipping, you could found the details of the predictions of LLMs on CriticBench. Typically, the format of the evaluation files are: `{split}_{domain}_{dimension}_{format}.json`, where `split`, `dimension`, and `format` are described above. The `domain` represents 9 task scenarios in our proposed CriticBench, consisting of `translate`, `chat`, `qa`, `harmlessness`, `summary`, `math_cot`, `math_pot`, `code_exec`, `code_not_exec`. Refer to more details in our paper. Here are some notes: * the `comp_feedback` critique dimension always company with an `reverse` file which is used to address the well-known positional bias problem for LLM-as-a-judge procedure. Refer to more details in Section 4 of our paper. * For `feedback` critique dimension, each `domain` has additional `*_correction_part.json` files, saving the evaluation results of critiques for the correct or the very high-quality responses. Refer to more details about these response in our paper. The format of the evaluation result file is: ```python { '0': { 'origin_prompt': 'The original prompt for LLMs to be evaluated', 'prediction': 'The generated critiques to be evaluated' } } ``` ### 3. Compute the Evaluation Results on CriticBench After getting the generation results under `inference/outputs`, your next step is to compute the objective and subjective scores in our proposed CriticBench using our toolkit. See more details about the objective and subjective scores in Section 4 of our paper. We provide two ways for computing the `objective` and `subjective` scores in `critic_bench` folder. * Objective scores could be computed automatically without any cost * Subjective scores rely on the advanced GPT-4-turbo model for automatic evaluation #### Compute Scores It is easy to compute the scores by running following commands. Before running this code, please make sure that your own OpenAI API key in [critic_bench/run.sh](critic_bench/run.sh) is set. ```bash export OPENAI_API_KEY=... ``` Then, running the following codes for evaluation: ```bash ./run.sh <dimension> <format> <split> <save_dir> ``` * `dimension` denotes critique dimensions defined in our proposed CriticBench, which are `feedback`, `correction`, `comp_feedback`, and `meta_feedback`. Refer to more details about these critique dimensions in Section 2 of our paper. * `format` denotes the critique format `objective` and `subjective`. Objective scores are spearman correlation, pass rate, preference accuracy that can be computed automatically without any cost, while subjective scores are computed by prompting GPT-4-turbo to compare generated critiques and our human-annotated high-quality critiques in CriticBench. * `split` denotes the `test` or `dev` set to be evaluated. * `save_dir` is any text path saving the evaluation results. In [run.sh](critic_bench/run.sh) file, you could find the corresponding commands for objective and subjective evaluation process. For example, for the feedback critique dimension, the objective evaluation is like: ```bash python run_feedback.py --root_dir "../data/CriticBench" --prediction_dir "../example_data/prediction_v1.3" --split $3 --obj True ``` * `root_dir` contains the path of the `test` and `dev` set in CriticBench. * `prediction_dir` contains the inference results of LLMs to be evaluated. We also provide the inference results of some representation LLMs in `example_data`. If you want to evaluate your own LLMs, please refer to `inference/README.md` for more details, and the `prediction_dir` could be set as `../inference/outputs`. * `split` denotes whether the `test` or the `dev` set is used. * `obj` denotes that the objective evaluation is activated For the subjective evaluation of the feedback critique dimension, the evaluation command is like: ```bash python run_feedback.py --root_dir "../data/CriticBench" --prediction_dir "../example_data/prediction_v1.3" --evaluation_dir "../example_data/evaluation_v1.3/" --batch_size 1 --split $3 --obj False ``` * `evaluation_dir` saves the subjective evaluation scores of GPT-4, which can be re-loaded if the subjective evaluation process borke off. The order of the samples in each file in `evaluation_dir` follows the order in the original data in CriticBench (`data/CriticBench`). * `batch_size` controls the number of the process for access GPT-4 API under multiprocessing setting. The evaluation results of GPT-4 under `save_dir` is `jsonl`, and each line contains the evaluation results. The chain-of-thought evaluation results prompted by GPT-4 is in the `evaluation` key-value in each line, which is a `dict` consisting of the chain-of-thought rationale about GPT-4 (key-value `cot`) and a Likert score (key-value `score`) for each critiques, ranging from 1 to 10. * 1 denotes the worst performance * 10 denotes the best performance * 8 denotes the comparable performance with our human-annotated high-quality critiques, and scores higher than 8 denotes the better performance of evaluated critiques. ## Benchmark Results The subjective evaluation results of some representation LLMs are shown: <img src="./figs/subjective_score.png" alt="sujective" align=center /> The Objective evaluation results of some representation LLMs are shown: <img src="./figs/objective_score.png" alt="objective" align=center /> Refer to our [Project Page](https://open-compass.github.io/CriticBench/) for the complete evaluation results on CriticBench. ## Acknowledgements CriticBench is built with [OpenCompass](https://github.com/open-compass/opencompass). Thanks for their awesome work! The quota for API-based LLMs are supported by Beijing Institute of Technology and Shanghai AI Laboratory. Thank you so much! ## Contact Us * **Tian Lan**: lantiangmftby@gmail.com * **Wenwei Zhang**: zhangwenwei@pjlab.org.cn ## BibTeX ``` @misc{lan2024criticbench, title={CriticBench: Evaluating Large Language Models as Critic}, author={Tian Lan and Wenwei Zhang and Chen Xu and Heyan Huang and Dahua Lin and Kai Chen and Xian-ling Mao}, year={2024}, eprint={2402.13764}, archivePrefix={arXiv}, primaryClass={cs.CL} } ``` ## License This project is released under the Apache 2.0 [license](./LICENSE).

# CriticBench：评估作为评判者的大语言模型本仓库为[CriticBench](https://arxiv.org/abs/2402.13764)的官方实现代码库，CriticBench是一个用于评估大语言模型评判能力的综合基准测试集。 ## 简介 **[CriticBench：评估作为评判者的大语言模型](https://arxiv.org/abs/2402.13764)** 兰天1*、张文伟2*、徐晨1、黄何言1、林达华2、陈恺2†、毛先领1† （† 通讯作者，* 同等贡献作者） 1 北京理工大学，2 上海人工智能实验室 [![arXiv](https://img.shields.io/badge/arXiv-2307.04725-b31b1b.svg)](https://arxiv.org/abs/2402.13764) [![license](https://img.shields.io/github/license/InternLM/opencompass.svg)](./LICENSE) [[HF数据集链接](https://huggingface.co/datasets/opencompass/CriticBench)] [[项目主页](https://open-compass.github.io/CriticBench/)] [[主观排行榜](https://open-compass.github.io/CriticBench/leaderboard_subjective.html)] [[客观排行榜](https://open-compass.github.io/CriticBench/leaderboard_objective.html)] > 评判能力在大语言模型（Large Language Model，LLM）的规模化监督与自我提升中至关重要。尽管近期诸多研究探索了LLM的评判能力，以判断并修正生成结果中的缺陷，但如何全面且可靠地衡量LLM的评判能力仍有待深入探索。本文提出**CriticBench**，这是一个全新的基准测试集，旨在全面可靠地评估LLM的四大核心评判能力维度：反馈（feedback）、对比（comparison）、修正（refinement）与元反馈（meta-feedback）。**CriticBench**包含九种多样化任务，分别从不同的质量粒度层级评估LLM的响应评判能力。我们针对开源与闭源LLM开展了大量评测，揭示了评判能力与任务类型、响应质量以及模型规模之间的有趣关联。 <img src="./figs/overview.png" alt="整体架构" align=center /> ## 最新更新 * **[2024.2.21]** 发布CriticBench v1.3版本的论文、代码、数据集及相关资源。 ## 待完成计划 - [ ] 评测Qwen-1.5系列模型 - [ ] 提升CriticBench主观评测的可靠性（v1.4版本） - [ ] 拓展至更多样化的任务 - [ ] 拓展至中文应用场景 - [ ] 整理并清理适配OpenCompass的代码库 - [ ] 发布CriticBench的训练集 - [ ] 支持在OpenCompass上进行推理 ## 快速上手 ### 1. 环境与数据准备 #### 1.1 准备数据集通过以下命令从[Hugging Face数据集平台](https://huggingface.co/datasets/opencompass/CriticBench)下载数据集： bash mkdir data cd data git clone https://huggingface.co/datasets/opencompass/CriticBench 该命令将进入`data`文件夹并克隆CriticBench数据集。请注意，测试集（test set）中的人工标注李克特（Likert）评分、偏好标签与评判结果已被隐藏。您可以通过`inference`文件夹下的代码生成测试集的推理结果，并将结果发送至邮箱<lantiangmftby@gmail.com>，我们将运行您的预测结果并更新至排行榜中。同时请提供您所评测模型的规模。您的提交文件结构需与`example_data`中的示例保持一致。 #### 1.2 准备代码与运行环境 bash git clone https://github.com/open-compass/CriticBench.git # 配置评测工具的运行环境 cd critic_bench pip install -r requirements.txt # 配置大语言模型推理的运行环境 cd ../inference pip install -r requirements.txt ### 2. 在CriticBench上对大语言模型进行推理您需要对欲评测的LLM在我们提出的CriticBench上执行推理，推理生成的结果将保存至`inference/outputs`文件夹中。若您对我们所用的LLM推理提示词感兴趣，可查看[inference/utils/prompts.py](inference/utils/prompts.py)文件。具体的推理代码示例如下： python # 从`inference/utils`加载CriticBench中的所有评测数据集 datasets = load_all_datasets(args['data_dir']) # 从Hugging Face加载分词器与模型 tokenizer = AutoTokenizer.from_pretrained( args['model_name'], trust_remote_code=True ) model = AutoModelForCausalLM.from_pretrained( args['model_name'], device_map="auto", trust_remote_code=True ).cuda().eval() ... # 对LLM执行推理并将结果保存为JSON格式 for abbr, dataset in tqdm(datasets.items()): path = os.path.join(folder_path, abbr + ".json") results = {} for item in tqdm(dataset['dev']): # 若需评测其他LLM，请修改此处代码 response, history = model.chat(tokenizer, item['question'], history=[]) results[str(len(results))] = { 'origin_prompt': item['question'], 'prediction': response } # 将结果保存为JSON文件，以任务缩写作为文件名 with open(path, 'w', encoding='utf-8') as f: json.dump(results, f, ensure_ascii=False, indent=4) 我们仅提供了针对[InternLM2-7B-Chat](https://huggingface.co/internlm/internlm2-chat-7b)的推理代码，但您可轻松修改代码以适配您自己的LLM（更多细节详见[inference/internlm2.py](./inference/internlm2.py)）。 #### 代表性大语言模型的推理数据示例我们已发布部分代表性LLM在CriticBench上的生成结果，您可在[example_data/prediction_v1.3.tgz](example_data/prediction_v1.3.tgz)中找到这些结果。 bash tar -xzvf example_data/prediction_v1.3.tgz 解压后，您可查看各LLM在CriticBench上的预测结果详情。通常来说，评测文件的命名格式为：`{split}_{domain}_{dimension}_{format}.json`，其中`split`、`dimension`与`format`的定义如前所述。`domain`代表CriticBench中的9种任务场景，具体包括`translate`（翻译）、`chat`（对话）、`qa`（问答）、`harmlessness`（无害性）、`summary`（摘要）、`math_cot`（数学思维链）、`math_pot`（数学程序解）、`code_exec`（可执行代码）、`code_not_exec`（不可执行代码）。更多细节请参考我们的论文。以下为几点说明： * 对于`comp_feedback`（对比反馈）评判维度，通常会附带一个`reverse`后缀的文件，用于解决LLM作为评判者时常见的位置偏倚问题。更多细节请参考本文第4节。 * 对于`feedback`（反馈）评判维度，每个`domain`均附带额外的`*_correction_part.json`文件，用于保存针对正确或高质量响应的评判结果评估详情。更多关于这些响应的细节请参考我们的论文。评测结果文件的格式如下： python { '0': { 'origin_prompt': '待评测大语言模型的原始提示词', 'prediction': '待评估的生成式评判结果' } } ### 3. 计算CriticBench的评测结果在`inference/outputs`文件夹中得到生成结果后，您可使用我们提供的工具计算CriticBench中的客观与主观评分。更多关于客观与主观评分的细节请参考本文第4节。我们在`critic_bench`文件夹中提供了两种计算**客观**与**主观**评分的方式： * 客观评分可自动计算，无需额外成本 * 主观评分需依托GPT-4-turbo模型完成自动评测 #### 计算评分通过以下命令即可轻松完成评分计算。在运行代码前，请确保您已在[critic_bench/run.sh](critic_bench/run.sh)中配置了您的OpenAI API密钥： bash export OPENAI_API_KEY=... 随后运行以下命令执行评测： bash ./run.sh <dimension> <format> <split> <save_dir> * `dimension`：代表CriticBench中定义的评判维度，包括`feedback`（反馈）、`correction`（修正）、`comp_feedback`（对比反馈）与`meta_feedback`（元反馈）。更多关于这些评判维度的细节请参考本文第2节。 * `format`：代表评测格式，分为`objective`（客观）与`subjective`（主观）。客观评分包括斯皮尔曼相关性、通过率、偏好准确率，可自动完成计算，无需额外成本；主观评分则通过提示GPT-4-turbo，将生成的评判结果与CriticBench中的人工标注高质量评判结果进行对比后计算得到。 * `split`：代表待评测的数据集划分，可选`test`（测试集）或`dev`（开发集）。 * `save_dir`：用于保存评测结果的任意文本路径。在[run.sh](critic_bench/run.sh)文件中，您可找到客观与主观评测流程对应的具体命令。例如，针对反馈评判维度的客观评测命令如下： bash python run_feedback.py --root_dir "../data/CriticBench" --prediction_dir "../example_data/prediction_v1.3" --split $3 --obj True * `root_dir`：CriticBench中测试集与开发集所在的路径。 * `prediction_dir`：待评测LLM的推理结果所在路径。我们也在`example_data`中提供了部分代表性LLM的推理结果。若您需评测自己的LLM，请参考`inference/README.md`中的更多细节，此时`prediction_dir`可设置为`../inference/outputs`。 * `split`：代表使用测试集还是开发集。 * `obj`：用于开启客观评测模式。针对反馈评判维度的主观评测命令如下： bash python run_feedback.py --root_dir "../data/CriticBench" --prediction_dir "../example_data/prediction_v1.3" --evaluation_dir "../example_data/evaluation_v1.3/" --batch_size 1 --split $3 --obj False * `evaluation_dir`：用于保存GPT-4的主观评测分数，若主观评测过程中断，该目录下的文件可用于重新加载已完成的评测结果。`evaluation_dir`中每个文件的样本顺序与CriticBench原始数据（`data/CriticBench`）中的样本顺序保持一致。 * `batch_size`：控制多进程调用GPT-4 API时的进程数量。 GPT-4的评测结果将以JSONL格式保存至`save_dir`中，每一行代表一条评测结果。由GPT-4通过思维链提示生成的评测结果包含在每一行的`evaluation`键值对中，该值为一个字典，包含GPT-4的思维链推理理由（键为`cot`）与针对每条评判结果的李克特评分（键为`score`，评分范围为1至10）： * 1代表表现最差 * 10代表表现最佳 * 8代表与人工标注的高质量评判结果性能相当，得分高于8则代表被评测的评判结果性能更优。 ## 基准评测结果部分代表性LLM的主观评测结果如下： <img src="./figs/subjective_score.png" alt="主观评分结果" align=center /> 部分代表性LLM的客观评测结果如下： <img src="./figs/objective_score.png" alt="客观评分结果" align=center /> 请访问我们的[项目主页](https://open-compass.github.io/CriticBench/)以查看CriticBench上的完整评测结果。 ## 致谢 **CriticBench**基于[OpenCompass](https://github.com/open-compass/opencompass)构建，感谢其出色的开源工作！本项目的API调用额度由北京理工大学与上海人工智能实验室提供，在此致以诚挚谢意！ ## 联系我们 * **兰天**：lantiangmftby@gmail.com * **张文伟**：zhangwenwei@pjlab.org.cn ## BibTeX引用格式 @misc{lan2024criticbench, title={CriticBench: Evaluating Large Language Models as Critic}, author={Tian Lan and Wenwei Zhang and Chen Xu and Heyan Huang and Dahua Lin and Kai Chen and Xian-ling Mao}, year={2024}, eprint={2402.13764}, archivePrefix={arXiv}, primaryClass={cs.CL} } ## 许可证本项目采用Apache 2.0[许可证](./LICENSE)发布。

提供机构：

opencompass

原始信息汇总

CriticBench: Evaluating Large Language Model as Critic

简介

CriticBench 是一个全面评估大型语言模型（LLMs）批判能力的基准。它包含九个多样化的任务，评估 LLMs 在反馈、比较、改进和元反馈四个关键批判能力维度的表现。

数据集下载

数据集可从 Hugging Face 下载： bash mkdir data cd data git clone https://huggingface.co/datasets/opencompass/CriticBench

注意：测试集中的人工标注的Likert评分、偏好标签和评论不包含在内。

数据集结构

数据集包含以下任务场景：

translate
chat
qa
harmlessness
summary
math_cot
math_pot
code_exec
code_not_exec

评估文件格式为：{split}_{domain}_{dimension}_{format}.json，其中 split、dimension 和 format 的详细描述请参考论文。

评估结果文件格式

python { 0: { origin_prompt: The original prompt for LLMs to be evaluated, prediction: The generated critiques to be evaluated } }

评估方法

提供两种评估方式：

客观评分：自动计算，无需成本。
主观评分：依赖 GPT-4-turbo 模型进行自动评估。

计算评分命令

bash ./run.sh <dimension> <format> <split> <save_dir>

dimension：批判维度，包括 feedback、correction、comp_feedback 和 meta_feedback。
format：批判格式，包括 objective 和 subjective。
split：评估集，包括 test 和 dev。
save_dir：保存评估结果的路径。

基准结果

主观评估结果和客观评估结果分别展示在 subjective_score.png 和 objective_score.png 中。

致谢

CriticBench 基于 OpenCompass 构建。

联系方式

Tian Lan: lantiangmftby@gmail.com
Wenwei Zhang: zhangwenwei@pjlab.org.cn

许可证

本项目基于 Apache 2.0 许可证发布。

搜集汇总

数据集介绍

背景与挑战

背景概述

CriticBench是一个用于评估大型语言模型批判能力的综合基准，涵盖四个关键维度和九个多样化任务。该数据集由北京理工大学和上海人工智能实验室开发，提供了详细的评估工具和示例数据，支持用户进行模型推理和性能评估。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集