five

bay-calibration-llm-evaluators/llmbar-annotated-latest

收藏
Hugging Face2024-11-18 更新2025-04-26 收录
下载链接:
https://hf-mirror.com/datasets/bay-calibration-llm-evaluators/llmbar-annotated-latest
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: task dtype: string - name: worker dtype: string - name: human_label dtype: int64 - name: llm_label dtype: int64 - name: generator_1 dtype: string - name: generator_2 dtype: string - name: instruction dtype: string - name: output_1 dtype: string - name: output_2 dtype: string - name: sub_dataset dtype: string - name: swap_equal dtype: bool splits: - name: train num_bytes: 8262754 num_examples: 6285 download_size: 386972 dataset_size: 8262754 configs: - config_name: default data_files: - split: train path: data/train-* --- # LLMBar-Select Dataset ## Introduction The **LLMBar-Select** dataset is a curated subset of the original **LLMBar** dataset introduced by [Zeng et al. (2024)](https://arxiv.org/abs/2310.07641). The LLMBar dataset consists of 419 instances, each containing an instruction paired with two outputs: one that faithfully follows the instruction and another that deviates while presenting superficially appealing qualities. It is designed to evaluate LLM-based evaluators more rigorously and objectively than previous benchmarks. The original dataset has two primary subsets: 1. **Natural Set**: Instances derived from existing human-preference datasets, filtered and modified for objective preferences. 2. **Adversarial Set**: Instances with outputs crafted to mislead evaluators by emphasizing superficial attributes. In the LLMBar dataset, instances are evaluated using five different large language models (GPT-4-0613, GPT-3.5-turbo-0613, GPT-3.5-turbo-0301, Llama-2-70B-Chat, and PaLM 2) under various prompting strategies (e.g., Vanilla, Chain of Thought (CoT), Swap+CoT, etc.). A unique combination of an LLM and a prompting strategy is referred to as an **evaluator mode**. The **LLMBar-Select** dataset focuses on evaluator modes with high evaluation accuracy. Specifically: - Only evaluator modes with an overall evaluation accuracy above **60%** are included. - Selected modes are based on **GPT-4** and **PaLM 2**: - **GPT-4**: Eight evaluator modes with evaluation accuracy exceeding 70%. - **PaLM 2**: Seven evaluator modes. With 419 comparison tasks in the original LLMBar dataset and 15 evaluator modes (8 from GPT-4 and 7 from PaLM 2), the LLMBar-Select dataset contains **6,285 rows** (419 × 15). This dataset accompanies the paper [**Gao et al. (2024). _Bayesian Calibration of Win Rate Estimation with LLM Evaluators_**](https://arxiv.org/abs/2411.04424). If you use this dataset, please cite both the **Gao et al. (2024)** and **Zeng et al. (2024)** papers. The original LLMBar dataset is available on [GitHub](https://github.com/princeton-nlp/LLMBar). ## Dataset Details ### Columns - **task**: Unique identifier for each comparison task in the format `t_{task ID}`. Tasks with identical instructions and outputs share the same task ID (starting from 0). - **worker**: Denotes the evaluator mode, formatted as `w_{model name}@{prompting strategy}`. - **human_label**: Always `0`, indicating `output_1` is the objectively better response. - **llm_label**: - `0`: LLM evaluator considers `output_1` better. - `1`: LLM evaluator considers `output_2` better. - **generator_1**: Always `"correct"`, signifying `output_1` is the superior response. - **generator_2**: Always `"incorrect"`, signifying `output_2` is the inferior response. - **instruction**: Instruction or query intended to be answered by the outputs. - **output_1**: First response to the instruction. - **output_2**: Second response to the instruction. - **sub_dataset**: Label indicating which subdataset the comparison belongs to. Refer to the LLMBar paper for details. - **swap_equal**: Boolean field denoting whether swapping the order of the outputs results in the same judgment by the LLM evaluator. If `false`, the `llm_label` is randomly assigned. ## Use Cases This dataset is useful for: - **LLM Evaluation Analysis**: Measuring the ability of LLMs under different prompting strategies to evaluate instruction-following tasks. - **LLM Bias Analysis**: Investigating model biases and evaluator tendencies. ## Citation If you use the LLMBar-Select dataset, please cite: - **Gao et al. (2024)**: [*Bayesian Calibration of Win Rate Estimation with LLM Evaluators*](https://arxiv.org/abs/2411.04424). - **Zeng et al. (2024)**: [*Evaluating Large Language Models at Evaluating Instruction Following*](https://arxiv.org/abs/2310.07641).

数据集信息: 特征项: - 字段名:task,数据类型:字符串(string) - 字段名:worker,数据类型:字符串(string) - 字段名:human_label,数据类型:64位整数(int64) - 字段名:llm_label,数据类型:64位整数(int64) - 字段名:generator_1,数据类型:字符串(string) - 字段名:generator_2,数据类型:字符串(string) - 字段名:instruction,数据类型:字符串(string) - 字段名:output_1,数据类型:字符串(string) - 字段名:output_2,数据类型:字符串(string) - 字段名:sub_dataset,数据类型:字符串(string) - 字段名:swap_equal,数据类型:布尔值(bool) 划分集: - 划分名称:train,字节数:8262754,样本数:6285 下载大小:386972 数据集总大小:8262754 配置项: - 配置名称:default,数据文件: - 划分:train,路径:data/train-* # LLMBar-Select 数据集 ## 简介 **LLMBar-Select** 数据集是由[Zeng等人(2024)](https://arxiv.org/abs/2310.07641)提出的原始**LLMBar**数据集的精选子集。原始LLMBar数据集包含419条样本,每条样本均配有一条指令与两个输出:一个严格遵循指令的合规输出,另一个虽表面具备吸引力却偏离指令要求的违规输出。该数据集旨在比以往基准更严谨、客观地评估基于大语言模型(Large Language Model, LLM)的评估器。 原始数据集包含两个主要子集: 1. **自然集(Natural Set)**:源自现有人类偏好数据集,经筛选与修改以适配客观偏好标注的样本。 2. **对抗集(Adversarial Set)**:输出刻意通过强调表面属性以误导评估器的样本。 在LLMBar数据集中,研究人员使用五种不同的大语言模型(GPT-4-0613、GPT-3.5-turbo-0613、GPT-3.5-turbo-0301、Llama-2-70B-Chat及PaLM 2),并结合多种提示策略(如标准提示(Vanilla)、思维链(Chain of Thought, CoT)、Swap+CoT等)对样本进行评估。将大语言模型与提示策略的唯一组合称为**评估器模式(evaluator mode)**。 **LLMBar-Select** 数据集聚焦于评估准确率较高的评估器模式。具体要求如下: - 仅纳入整体评估准确率高于**60%**的评估器模式; - 入选的评估器模式基于**GPT-4**与**PaLM 2**: - **GPT-4**:包含8个评估准确率超过70%的评估器模式; - **PaLM 2**:包含7个评估器模式。 原始LLMBar数据集共有419个对比任务与15个评估器模式(8个来自GPT-4,7个来自PaLM 2),因此LLMBar-Select数据集共包含**6285条数据**(419 × 15)。 本数据集配套论文[**Gao等人(2024). _基于大语言模型评估器的胜率估计贝叶斯校准_**](https://arxiv.org/abs/2411.04424)。若使用本数据集,请同时引用**Gao等人(2024)**与**Zeng等人(2024)**的论文。原始LLMBar数据集可在[GitHub](https://github.com/princeton-nlp/LLMBar)获取。 ## 数据集详情 ### 字段说明 - **task**:每个对比任务的唯一标识符,格式为`t_{任务ID}`。指令与输出完全一致的任务共享相同的任务ID(从0开始编号)。 - **worker**:表示评估器模式,格式为`w_{模型名称}@{提示策略}`。 - **human_label**:固定为`0`,代表`output_1`为客观更优的回复。 - **llm_label**: - `0`:大语言模型评估器认为`output_1`更优; - `1`:大语言模型评估器认为`output_2`更优。 - **generator_1**:固定为`"correct"`,表示`output_1`为更优质的回复。 - **generator_2**:固定为`"incorrect"`,表示`output_2`为质量较差的回复。 - **instruction**:需要输出回复的指令或查询。 - **output_1**:针对指令的第一条回复。 - **output_2**:针对指令的第二条回复。 - **sub_dataset**:标注该对比任务所属的子数据集,详细信息请参考LLMBar原论文。 - **swap_equal**:布尔类型字段,表示交换两个输出的顺序后,大语言模型评估器的判断是否保持一致。若为`false`,则`llm_label`为随机分配的结果。 ## 应用场景 本数据集可用于: - **大语言模型评估分析**:衡量不同提示策略下的大语言模型对指令遵循任务的评估能力; - **大语言模型偏差分析**:探究模型偏差与评估器的评估倾向。 ## 引用 若使用LLMBar-Select数据集,请引用: - **Gao等人(2024)**: [*基于大语言模型评估器的胜率估计贝叶斯校准*](https://arxiv.org/abs/2411.04424)。 - **Zeng等人(2024)**: [*面向指令遵循评估任务的大语言模型评测*](https://arxiv.org/abs/2310.07641)。
提供机构:
bay-calibration-llm-evaluators
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作