bay-calibration-llm-evaluators/hanna-annotated-latest
收藏Hugging Face2024-11-14 更新2025-04-26 收录
下载链接:
https://hf-mirror.com/datasets/bay-calibration-llm-evaluators/hanna-annotated-latest
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: task
dtype: string
- name: worker
dtype: string
- name: human_label
dtype: int64
- name: llm_label
dtype: int64
- name: generator_1
dtype: string
- name: generator_2
dtype: string
- name: premise
dtype: string
- name: __index_level_0__
dtype: float64
splits:
- name: train
num_bytes: 7006770
num_examples: 31680
download_size: 307756
dataset_size: 7006770
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
---
# HANNA-LLMEval Dataset
## Overview
The original **HANNA** dataset (Chhun et al., 2022) contains 1,056 stories, each annotated by human raters using a 5-point Likert scale across six criteria: **Relevance**, **Coherence**, **Empathy**, **Surprise**, **Engagement**, and **Complexity**. These stories are based on 96 story prompts from the **WritingPrompts** dataset (Fan et al., 2018), with each prompt generating 11 stories, including one human-written and 10 generated by different automatic text generation models.
This **HANNA-LLMEval** dataset builds upon this framework by adding LLM evaluations on pairs of stories generated by different text generators (including human) for the same prompt. This dataset accompanies the paper [**Gao et al. (2024). _Bayesian Calibration of Win Rate Estimation with LLM Evaluators_**](https://arxiv.org/abs/2411.04424). Please cite this paper if you use this dataset in your work.
For more details on the original HANNA dataset, please refer to the [HANNA paper](https://arxiv.org/abs/2208.11646).
## Source and Licensing
The original **HANNA** dataset is available on [GitHub](https://github.com/dig-team/hanna-benchmark-asg). Please consult the HANNA dataset's publication and licensing terms before using this dataset.
## Dataset Columns
- **task**: A unique identifier for each comparison task. Each task corresponds to a unique combination of premise, generator_1, and generator_2. Task labels are in the format "t_{task ID}". Tasks with the same premise, generator_1, and generator_2 will share the same task ID. Task ID starts from 0.
- **worker**: Identifies the evaluator mode used to assess the comparison task. The format is "w_{model name}-{prompting strategy}".
- **human_label**:
- `0`: Generator_1 is considered to produce a better story than Generator_2 by human evaluators.
- `1`: Generator_2 is considered to produce a better story than Generator_1 by human evaluators.
The label is determined by summing the scores from all human evaluators involved.
- **llm_label**:
- `0`: Generator_1 is considered to produce a better story than Generator_2 by the LLM evaluator (worker).
- `1`: Generator_2 is considered to produce a better story than Generator_1 by the LLM evaluator (worker).
- **generator_1**: The first text generator for comparision.
- **generator_2**: The second text generator for comparison.
- **premise**: The writing prompt based on which the text generators are asked to generate the stories.
- **__index_level_0__**: A column that is not useful and should be disregarded.
## Dataset Extensions
The original **HANNA** dataset includes 11 text generators (including human) across 96 story prompts, resulting in 55 distinct generator pairs for comparison. This leads to a total of 96 * 55 = 5,280 unique comparison tasks.
In the **HANNA-LLMEval** dataset, we extend this by including evaluations from two LLMs: **GPT-3.5-turbo 0125** and **Gemini-1.0-Pro**, each using three distinct prompting strategies (Score-only, Rate-explain, Analyze-rate). Therefore, there are 6 evaluator modes in total, resulting in 5,280 * 6 = 31,680 rows in the dataset.
Each comparison task is evaluated twice per evaluator mode, with the order of the stories switched in each trial. The scores from both evaluations are then summed across the six evaluation criteria (coherence, empathy, etc.) to determine the final score for each story. The story with the higher final score is deemed the "winner" of the comparison. If the two stories happen to have the same final score, the winner is picked randomly.
## Usage
You can access and use this dataset for tasks such as:
- Evaluating the performance of different text generation models.
- Investigating LLM-based story evaluation and ranking.
- Exploring model biases and tendencies across various evaluation criteria.
## Citation
- Gao et al. (2024). [*Bayesian Calibration of Win Rate Estimation with LLM Evaluators*.](https://arxiv.org/abs/2411.04424)
- Chhun et al. (2022). [*Of Human Criteria and Automatic Metrics: A Benchmark of the Evaluation of Story Generation*.](https://arxiv.org/abs/2208.11646)
- Fan et al. (2018). [*Hierarchical Neural Story Generation*.](https://arxiv.org/abs/1805.04833)
dataset_info:
数据集信息:
特征:
- 名称: task
数据类型: 字符串
- 名称: worker
数据类型: 字符串
- 名称: human_label
数据类型: 64位整数
- 名称: llm_label
数据类型: 64位整数
- 名称: generator_1
数据类型: 字符串
- 名称: generator_2
数据类型: 字符串
- 名称: premise
数据类型: 字符串
- 名称: __index_level_0__
数据类型: 64位浮点数
划分集:
- 名称: train
字节数: 7006770
样本数: 31680
下载大小: 307756
数据集总大小: 7006770
配置:
- 配置名称: default
数据文件:
- 划分集: train
路径: data/train-*
# HANNA-LLMEval 数据集
## 概述
原始**HANNA**数据集(Chhun等人,2022)包含1056篇故事,每篇均由人类评分者基于6项指标使用5级李克特量表(Likert scale)进行标注:**相关性(Relevance)**、**连贯性(Coherence)**、**共情性(Empathy)**、**惊喜度(Surprise)**、**吸引力(Engagement)**与**复杂度(Complexity)**。这些故事源自**WritingPrompts**数据集(Fan等人,2018)中的96个故事提示,每个提示可生成11篇故事,其中1篇为人类创作,其余10篇由不同的自动文本生成模型创作。
本**HANNA-LLMEval**数据集在此框架基础上新增了大语言模型(Large Language Model,LLM)对同一提示下不同文本生成器(包含人类创作者)生成的故事对的评估结果。本数据集配套论文为[**Gao等人(2024). _基于大语言模型评估器的胜率估计贝叶斯校准_**](https://arxiv.org/abs/2411.04424),若您在研究中使用本数据集,请引用该论文。
如需了解原始HANNA数据集的更多细节,请参阅[HANNA原论文](https://arxiv.org/abs/2208.11646)。
## 来源与授权
原始**HANNA**数据集可在[GitHub](https://github.com/dig-team/hanna-benchmark-asg)获取。使用本数据集前,请查阅HANNA数据集的发表说明与授权条款。
## 数据集字段说明
- **task**:每个对比任务的唯一标识符。每个任务对应前提(premise)、generator_1与generator_2的唯一组合。任务标签格式为`"t_{任务ID}"`。具有相同前提、generator_1和generator_2的任务将共享同一任务ID,任务ID从0开始计数。
- **worker**:用于标识评估该对比任务的评估器模式,格式为`"w_{模型名称}-{提示策略}"`。
- **human_label**:
- `0`:人类评估者认为generator_1生成的故事优于generator_2。
- `1`:人类评估者认为generator_2生成的故事优于generator_1。
该标签由所有参与评估的人类评分者的得分求和后确定。
- **llm_label**:
- `0`:大语言模型评估器(即worker字段所指模型)认为generator_1生成的故事优于generator_2。
- `1`:大语言模型评估器认为generator_2生成的故事优于generator_1。
- **generator_1**:用于对比的第一个文本生成器。
- **generator_2**:用于对比的第二个文本生成器。
- **premise**:文本生成器用于创作故事的写作提示。
- **__index_level_0__**:无实用价值的列,可忽略。
## 数据集扩展说明
原始**HANNA**数据集涵盖96个故事提示,包含11个文本生成器(包含人类创作者),可生成55组独特的生成器对用于对比,总计96×55=5280个独特的对比任务。
在**HANNA-LLMEval**数据集中,我们新增了两个大语言模型的评估结果:**GPT-3.5-turbo 0125**与**Gemini-1.0-Pro**,每个模型均使用三种不同的提示策略:仅评分(Score-only)、评分-解释(Rate-explain)、分析-评分(Analyze-rate),总计6种评估器模式。因此本数据集的总样本量为5280×6=31680条。
每个对比任务在每种评估器模式下均进行两次评估,两次评估中交换两个故事的顺序。随后将六项评估指标(连贯性、共情性等)的得分相加,得到每篇故事的最终得分,最终得分更高的故事被视为该对比任务的“获胜者”。若两篇故事最终得分相同,则随机选择其中一篇作为获胜者。
## 使用场景
您可将本数据集用于以下研究任务:
- 评估不同文本生成模型的性能
- 研究基于大语言模型的故事评估与排序方法
- 探索不同评估指标下的模型偏差与倾向性
## 引用文献
- Gao等人(2024). [*基于大语言模型评估器的胜率估计贝叶斯校准*](https://arxiv.org/abs/2411.04424)
- Chhun等人(2022). [*人类评估标准与自动度量:故事生成评估基准*](https://arxiv.org/abs/2208.11646)
- Fan等人(2018). [*层级式神经网络故事生成*](https://arxiv.org/abs/1805.04833)
提供机构:
bay-calibration-llm-evaluators



