bay-calibration-llm-evaluators/llmbar-annotated-latest

Name: bay-calibration-llm-evaluators/llmbar-annotated-latest
Creator: bay-calibration-llm-evaluators
Published: 2024-11-18 07:28:47
License: 暂无描述

Hugging Face2024-11-18 更新2025-04-26 收录

下载链接：

https://hf-mirror.com/datasets/bay-calibration-llm-evaluators/llmbar-annotated-latest

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: task dtype: string - name: worker dtype: string - name: human_label dtype: int64 - name: llm_label dtype: int64 - name: generator_1 dtype: string - name: generator_2 dtype: string - name: instruction dtype: string - name: output_1 dtype: string - name: output_2 dtype: string - name: sub_dataset dtype: string - name: swap_equal dtype: bool splits: - name: train num_bytes: 8262754 num_examples: 6285 download_size: 386972 dataset_size: 8262754 configs: - config_name: default data_files: - split: train path: data/train-* --- # LLMBar-Select Dataset ## Introduction The **LLMBar-Select** dataset is a curated subset of the original **LLMBar** dataset introduced by [Zeng et al. (2024)](https://arxiv.org/abs/2310.07641). The LLMBar dataset consists of 419 instances, each containing an instruction paired with two outputs: one that faithfully follows the instruction and another that deviates while presenting superficially appealing qualities. It is designed to evaluate LLM-based evaluators more rigorously and objectively than previous benchmarks. The original dataset has two primary subsets: 1. **Natural Set**: Instances derived from existing human-preference datasets, filtered and modified for objective preferences. 2. **Adversarial Set**: Instances with outputs crafted to mislead evaluators by emphasizing superficial attributes. In the LLMBar dataset, instances are evaluated using five different large language models (GPT-4-0613, GPT-3.5-turbo-0613, GPT-3.5-turbo-0301, Llama-2-70B-Chat, and PaLM 2) under various prompting strategies (e.g., Vanilla, Chain of Thought (CoT), Swap+CoT, etc.). A unique combination of an LLM and a prompting strategy is referred to as an **evaluator mode**. The **LLMBar-Select** dataset focuses on evaluator modes with high evaluation accuracy. Specifically: - Only evaluator modes with an overall evaluation accuracy above **60%** are included. - Selected modes are based on **GPT-4** and **PaLM 2**: - **GPT-4**: Eight evaluator modes with evaluation accuracy exceeding 70%. - **PaLM 2**: Seven evaluator modes. With 419 comparison tasks in the original LLMBar dataset and 15 evaluator modes (8 from GPT-4 and 7 from PaLM 2), the LLMBar-Select dataset contains **6,285 rows** (419 × 15). This dataset accompanies the paper [**Gao et al. (2024). _Bayesian Calibration of Win Rate Estimation with LLM Evaluators_**](https://arxiv.org/abs/2411.04424). If you use this dataset, please cite both the **Gao et al. (2024)** and **Zeng et al. (2024)** papers. The original LLMBar dataset is available on [GitHub](https://github.com/princeton-nlp/LLMBar). ## Dataset Details ### Columns - **task**: Unique identifier for each comparison task in the format `t_{task ID}`. Tasks with identical instructions and outputs share the same task ID (starting from 0). - **worker**: Denotes the evaluator mode, formatted as `w_{model name}@{prompting strategy}`. - **human_label**: Always `0`, indicating `output_1` is the objectively better response. - **llm_label**: - `0`: LLM evaluator considers `output_1` better. - `1`: LLM evaluator considers `output_2` better. - **generator_1**: Always `"correct"`, signifying `output_1` is the superior response. - **generator_2**: Always `"incorrect"`, signifying `output_2` is the inferior response. - **instruction**: Instruction or query intended to be answered by the outputs. - **output_1**: First response to the instruction. - **output_2**: Second response to the instruction. - **sub_dataset**: Label indicating which subdataset the comparison belongs to. Refer to the LLMBar paper for details. - **swap_equal**: Boolean field denoting whether swapping the order of the outputs results in the same judgment by the LLM evaluator. If `false`, the `llm_label` is randomly assigned. ## Use Cases This dataset is useful for: - **LLM Evaluation Analysis**: Measuring the ability of LLMs under different prompting strategies to evaluate instruction-following tasks. - **LLM Bias Analysis**: Investigating model biases and evaluator tendencies. ## Citation If you use the LLMBar-Select dataset, please cite: - **Gao et al. (2024)**: [*Bayesian Calibration of Win Rate Estimation with LLM Evaluators*](https://arxiv.org/abs/2411.04424). - **Zeng et al. (2024)**: [*Evaluating Large Language Models at Evaluating Instruction Following*](https://arxiv.org/abs/2310.07641).

数据集信息：特征项： - 字段名：task，数据类型：字符串（string） - 字段名：worker，数据类型：字符串（string） - 字段名：human_label，数据类型：64位整数（int64） - 字段名：llm_label，数据类型：64位整数（int64） - 字段名：generator_1，数据类型：字符串（string） - 字段名：generator_2，数据类型：字符串（string） - 字段名：instruction，数据类型：字符串（string） - 字段名：output_1，数据类型：字符串（string） - 字段名：output_2，数据类型：字符串（string） - 字段名：sub_dataset，数据类型：字符串（string） - 字段名：swap_equal，数据类型：布尔值（bool）划分集： - 划分名称：train，字节数：8262754，样本数：6285 下载大小：386972 数据集总大小：8262754 配置项： - 配置名称：default，数据文件： - 划分：train，路径：data/train-* # LLMBar-Select 数据集 ## 简介 **LLMBar-Select** 数据集是由[Zeng等人(2024)](https://arxiv.org/abs/2310.07641)提出的原始**LLMBar**数据集的精选子集。原始LLMBar数据集包含419条样本，每条样本均配有一条指令与两个输出：一个严格遵循指令的合规输出，另一个虽表面具备吸引力却偏离指令要求的违规输出。该数据集旨在比以往基准更严谨、客观地评估基于大语言模型（Large Language Model, LLM）的评估器。原始数据集包含两个主要子集： 1. **自然集（Natural Set）**：源自现有人类偏好数据集，经筛选与修改以适配客观偏好标注的样本。 2. **对抗集（Adversarial Set）**：输出刻意通过强调表面属性以误导评估器的样本。在LLMBar数据集中，研究人员使用五种不同的大语言模型（GPT-4-0613、GPT-3.5-turbo-0613、GPT-3.5-turbo-0301、Llama-2-70B-Chat及PaLM 2），并结合多种提示策略（如标准提示（Vanilla）、思维链（Chain of Thought, CoT）、Swap+CoT等）对样本进行评估。将大语言模型与提示策略的唯一组合称为**评估器模式（evaluator mode）**。 **LLMBar-Select** 数据集聚焦于评估准确率较高的评估器模式。具体要求如下： - 仅纳入整体评估准确率高于**60%**的评估器模式； - 入选的评估器模式基于**GPT-4**与**PaLM 2**： - **GPT-4**：包含8个评估准确率超过70%的评估器模式； - **PaLM 2**：包含7个评估器模式。原始LLMBar数据集共有419个对比任务与15个评估器模式（8个来自GPT-4，7个来自PaLM 2），因此LLMBar-Select数据集共包含**6285条数据**（419 × 15）。本数据集配套论文[**Gao等人(2024). _基于大语言模型评估器的胜率估计贝叶斯校准_**](https://arxiv.org/abs/2411.04424)。若使用本数据集，请同时引用**Gao等人(2024)**与**Zeng等人(2024)**的论文。原始LLMBar数据集可在[GitHub](https://github.com/princeton-nlp/LLMBar)获取。 ## 数据集详情 ### 字段说明 - **task**：每个对比任务的唯一标识符，格式为`t_{任务ID}`。指令与输出完全一致的任务共享相同的任务ID（从0开始编号）。 - **worker**：表示评估器模式，格式为`w_{模型名称}@{提示策略}`。 - **human_label**：固定为`0`，代表`output_1`为客观更优的回复。 - **llm_label**： - `0`：大语言模型评估器认为`output_1`更优； - `1`：大语言模型评估器认为`output_2`更优。 - **generator_1**：固定为`"correct"`，表示`output_1`为更优质的回复。 - **generator_2**：固定为`"incorrect"`，表示`output_2`为质量较差的回复。 - **instruction**：需要输出回复的指令或查询。 - **output_1**：针对指令的第一条回复。 - **output_2**：针对指令的第二条回复。 - **sub_dataset**：标注该对比任务所属的子数据集，详细信息请参考LLMBar原论文。 - **swap_equal**：布尔类型字段，表示交换两个输出的顺序后，大语言模型评估器的判断是否保持一致。若为`false`，则`llm_label`为随机分配的结果。 ## 应用场景本数据集可用于： - **大语言模型评估分析**：衡量不同提示策略下的大语言模型对指令遵循任务的评估能力； - **大语言模型偏差分析**：探究模型偏差与评估器的评估倾向。 ## 引用若使用LLMBar-Select数据集，请引用： - **Gao等人(2024)**： [*基于大语言模型评估器的胜率估计贝叶斯校准*](https://arxiv.org/abs/2411.04424)。 - **Zeng等人(2024)**： [*面向指令遵循评估任务的大语言模型评测*](https://arxiv.org/abs/2310.07641)。

提供机构：

bay-calibration-llm-evaluators

5,000+

优质数据集

54 个

任务类型

进入经典数据集