bay-calibration-llm-evaluators/llmbar-annotated-latest
收藏Hugging Face2024-11-18 更新2025-04-26 收录
下载链接:
https://hf-mirror.com/datasets/bay-calibration-llm-evaluators/llmbar-annotated-latest
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: task
dtype: string
- name: worker
dtype: string
- name: human_label
dtype: int64
- name: llm_label
dtype: int64
- name: generator_1
dtype: string
- name: generator_2
dtype: string
- name: instruction
dtype: string
- name: output_1
dtype: string
- name: output_2
dtype: string
- name: sub_dataset
dtype: string
- name: swap_equal
dtype: bool
splits:
- name: train
num_bytes: 8262754
num_examples: 6285
download_size: 386972
dataset_size: 8262754
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
---
# LLMBar-Select Dataset
## Introduction
The **LLMBar-Select** dataset is a curated subset of the original **LLMBar** dataset introduced by [Zeng et al. (2024)](https://arxiv.org/abs/2310.07641). The LLMBar dataset consists of 419 instances, each containing an instruction paired with two outputs: one that faithfully follows the instruction and another that deviates while presenting superficially appealing qualities. It is designed to evaluate LLM-based evaluators more rigorously and objectively than previous benchmarks.
The original dataset has two primary subsets:
1. **Natural Set**: Instances derived from existing human-preference datasets, filtered and modified for objective preferences.
2. **Adversarial Set**: Instances with outputs crafted to mislead evaluators by emphasizing superficial attributes.
In the LLMBar dataset, instances are evaluated using five different large language models (GPT-4-0613, GPT-3.5-turbo-0613, GPT-3.5-turbo-0301, Llama-2-70B-Chat, and PaLM 2) under various prompting strategies (e.g., Vanilla, Chain of Thought (CoT), Swap+CoT, etc.). A unique combination of an LLM and a prompting strategy is referred to as an **evaluator mode**.
The **LLMBar-Select** dataset focuses on evaluator modes with high evaluation accuracy. Specifically:
- Only evaluator modes with an overall evaluation accuracy above **60%** are included.
- Selected modes are based on **GPT-4** and **PaLM 2**:
- **GPT-4**: Eight evaluator modes with evaluation accuracy exceeding 70%.
- **PaLM 2**: Seven evaluator modes.
With 419 comparison tasks in the original LLMBar dataset and 15 evaluator modes (8 from GPT-4 and 7 from PaLM 2), the LLMBar-Select dataset contains **6,285 rows** (419 × 15).
This dataset accompanies the paper [**Gao et al. (2024). _Bayesian Calibration of Win Rate Estimation with LLM Evaluators_**](https://arxiv.org/abs/2411.04424). If you use this dataset, please cite both the **Gao et al. (2024)** and **Zeng et al. (2024)** papers. The original LLMBar dataset is available on [GitHub](https://github.com/princeton-nlp/LLMBar).
## Dataset Details
### Columns
- **task**: Unique identifier for each comparison task in the format `t_{task ID}`. Tasks with identical instructions and outputs share the same task ID (starting from 0).
- **worker**: Denotes the evaluator mode, formatted as `w_{model name}@{prompting strategy}`.
- **human_label**: Always `0`, indicating `output_1` is the objectively better response.
- **llm_label**:
- `0`: LLM evaluator considers `output_1` better.
- `1`: LLM evaluator considers `output_2` better.
- **generator_1**: Always `"correct"`, signifying `output_1` is the superior response.
- **generator_2**: Always `"incorrect"`, signifying `output_2` is the inferior response.
- **instruction**: Instruction or query intended to be answered by the outputs.
- **output_1**: First response to the instruction.
- **output_2**: Second response to the instruction.
- **sub_dataset**: Label indicating which subdataset the comparison belongs to. Refer to the LLMBar paper for details.
- **swap_equal**: Boolean field denoting whether swapping the order of the outputs results in the same judgment by the LLM evaluator. If `false`, the `llm_label` is randomly assigned.
## Use Cases
This dataset is useful for:
- **LLM Evaluation Analysis**: Measuring the ability of LLMs under different prompting strategies to evaluate instruction-following tasks.
- **LLM Bias Analysis**: Investigating model biases and evaluator tendencies.
## Citation
If you use the LLMBar-Select dataset, please cite:
- **Gao et al. (2024)**:
[*Bayesian Calibration of Win Rate Estimation with LLM Evaluators*](https://arxiv.org/abs/2411.04424).
- **Zeng et al. (2024)**:
[*Evaluating Large Language Models at Evaluating Instruction Following*](https://arxiv.org/abs/2310.07641).
数据集信息:
特征项:
- 字段名:task,数据类型:字符串(string)
- 字段名:worker,数据类型:字符串(string)
- 字段名:human_label,数据类型:64位整数(int64)
- 字段名:llm_label,数据类型:64位整数(int64)
- 字段名:generator_1,数据类型:字符串(string)
- 字段名:generator_2,数据类型:字符串(string)
- 字段名:instruction,数据类型:字符串(string)
- 字段名:output_1,数据类型:字符串(string)
- 字段名:output_2,数据类型:字符串(string)
- 字段名:sub_dataset,数据类型:字符串(string)
- 字段名:swap_equal,数据类型:布尔值(bool)
划分集:
- 划分名称:train,字节数:8262754,样本数:6285
下载大小:386972
数据集总大小:8262754
配置项:
- 配置名称:default,数据文件:
- 划分:train,路径:data/train-*
# LLMBar-Select 数据集
## 简介
**LLMBar-Select** 数据集是由[Zeng等人(2024)](https://arxiv.org/abs/2310.07641)提出的原始**LLMBar**数据集的精选子集。原始LLMBar数据集包含419条样本,每条样本均配有一条指令与两个输出:一个严格遵循指令的合规输出,另一个虽表面具备吸引力却偏离指令要求的违规输出。该数据集旨在比以往基准更严谨、客观地评估基于大语言模型(Large Language Model, LLM)的评估器。
原始数据集包含两个主要子集:
1. **自然集(Natural Set)**:源自现有人类偏好数据集,经筛选与修改以适配客观偏好标注的样本。
2. **对抗集(Adversarial Set)**:输出刻意通过强调表面属性以误导评估器的样本。
在LLMBar数据集中,研究人员使用五种不同的大语言模型(GPT-4-0613、GPT-3.5-turbo-0613、GPT-3.5-turbo-0301、Llama-2-70B-Chat及PaLM 2),并结合多种提示策略(如标准提示(Vanilla)、思维链(Chain of Thought, CoT)、Swap+CoT等)对样本进行评估。将大语言模型与提示策略的唯一组合称为**评估器模式(evaluator mode)**。
**LLMBar-Select** 数据集聚焦于评估准确率较高的评估器模式。具体要求如下:
- 仅纳入整体评估准确率高于**60%**的评估器模式;
- 入选的评估器模式基于**GPT-4**与**PaLM 2**:
- **GPT-4**:包含8个评估准确率超过70%的评估器模式;
- **PaLM 2**:包含7个评估器模式。
原始LLMBar数据集共有419个对比任务与15个评估器模式(8个来自GPT-4,7个来自PaLM 2),因此LLMBar-Select数据集共包含**6285条数据**(419 × 15)。
本数据集配套论文[**Gao等人(2024). _基于大语言模型评估器的胜率估计贝叶斯校准_**](https://arxiv.org/abs/2411.04424)。若使用本数据集,请同时引用**Gao等人(2024)**与**Zeng等人(2024)**的论文。原始LLMBar数据集可在[GitHub](https://github.com/princeton-nlp/LLMBar)获取。
## 数据集详情
### 字段说明
- **task**:每个对比任务的唯一标识符,格式为`t_{任务ID}`。指令与输出完全一致的任务共享相同的任务ID(从0开始编号)。
- **worker**:表示评估器模式,格式为`w_{模型名称}@{提示策略}`。
- **human_label**:固定为`0`,代表`output_1`为客观更优的回复。
- **llm_label**:
- `0`:大语言模型评估器认为`output_1`更优;
- `1`:大语言模型评估器认为`output_2`更优。
- **generator_1**:固定为`"correct"`,表示`output_1`为更优质的回复。
- **generator_2**:固定为`"incorrect"`,表示`output_2`为质量较差的回复。
- **instruction**:需要输出回复的指令或查询。
- **output_1**:针对指令的第一条回复。
- **output_2**:针对指令的第二条回复。
- **sub_dataset**:标注该对比任务所属的子数据集,详细信息请参考LLMBar原论文。
- **swap_equal**:布尔类型字段,表示交换两个输出的顺序后,大语言模型评估器的判断是否保持一致。若为`false`,则`llm_label`为随机分配的结果。
## 应用场景
本数据集可用于:
- **大语言模型评估分析**:衡量不同提示策略下的大语言模型对指令遵循任务的评估能力;
- **大语言模型偏差分析**:探究模型偏差与评估器的评估倾向。
## 引用
若使用LLMBar-Select数据集,请引用:
- **Gao等人(2024)**:
[*基于大语言模型评估器的胜率估计贝叶斯校准*](https://arxiv.org/abs/2411.04424)。
- **Zeng等人(2024)**:
[*面向指令遵循评估任务的大语言模型评测*](https://arxiv.org/abs/2310.07641)。
提供机构:
bay-calibration-llm-evaluators



