huuuyeah/DecipherPref

Name: huuuyeah/DecipherPref
Creator: huuuyeah
Published: 2024-05-23 19:25:26
License: 暂无描述

Hugging Face2024-05-23 更新2024-06-12 收录

下载链接：

https://hf-mirror.com/datasets/huuuyeah/DecipherPref

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 task_categories: - summarization - text-classification language: - en tags: - Preference - Annotated Data - Alignment size_categories: - 10M<n<100M --- ## Overview Human preference judgments are pivotal in guiding large language models (LLMs) to produce outputs that align with human values. Human evaluations are also used in summarization tasks to compare outputs from various systems, complementing existing automatic metrics. Despite their significance, however, there has been limited research probing these pairwise or k-wise comparisons. The collective impact and relative importance of factors such as output length, informativeness, fluency, and factual consistency are still not well understood. It is also unclear if there are other hidden factors influencing human judgments. In this paper, we conduct an in-depth examination of a collection of pairwise human judgments released by OpenAI. Utilizing the Bradley-Terry-Luce (BTL) model, we reveal the inherent preferences embedded in these human judgments. ## Data Structure ```json { "doc_id": <str>, "title": <str>, "article": <str>, # source document "winner_sum": { "text": <str>, "policy": <str>, "annotation": <dict>, # GPT-4 annotation on proposed criterions "preference_factors": <list> # List of final preference factors of each summary } "defeated_sum": { "text": <str>, "policy": <str>, "annotation": <dict>, "preference_factors": <list> } } ``` ## Usage #### Load from Huggingface (UNAVAILABLE) ```python from datasets import load_dataset dataset = load_dataset("huuuyeah/DecipherPref") preference_data = dataset['train'] print(preference_data[0]) ``` #### Load from local Download the *train.json* to local folder. ```python import json data = [] with open(<PATH_JSON_DATA>, 'r') as r: for line in r: data.append(json.loads(line.strip())) print(data[0]) ``` ## Acknowledgement Please cite the following paper in work that makes use of this dataset: [DecipherPref: Analyzing Influential Factors in Human Preference Judgments via GPT-4](https://aclanthology.org/2023.emnlp-main.519/)\ Yebowen Hu, Kaiqiang Song, Sangwoo Cho, Xiaoyang Wang, Hassan Foroosh, Fei Liu\ In main conference of Empirical Methods in Natural Language Processing(EMNLP'23), Singapore. ## Bibtex ``` @inproceedings{hu-etal-2023-decipherpref, title = "{D}ecipher{P}ref: Analyzing Influential Factors in Human Preference Judgments via {GPT}-4", author = "Hu, Yebowen and Song, Kaiqiang and Cho, Sangwoo and Wang, Xiaoyang and Foroosh, Hassan and Liu, Fei", booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing", month = dec, year = "2023", address = "Singapore", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2023.emnlp-main.519", doi = "10.18653/v1/2023.emnlp-main.519", pages = "8344--8357", } ```

提供机构：

huuuyeah

原始信息汇总

数据集概述

人类偏好判断在指导大型语言模型（LLMs）生成符合人类价值观的输出方面起着至关重要的作用。在摘要任务中，人类评估也被用来比较来自不同系统的输出，补充现有的自动度量标准。尽管其重要性，但对于这些成对或k-wise比较的研究仍然有限。输出长度、信息量、流畅性和事实一致性等因素的集体影响和相对重要性仍未得到充分理解。也不清楚是否存在其他隐藏因素影响人类判断。在本文中，我们对OpenAI发布的一系列成对人类判断进行了深入研究。利用Bradley-Terry-Luce（BTL）模型，我们揭示了这些人类判断中嵌入的固有偏好。

数据结构

json { "doc_id": <str>, "title": <str>, "article": <str>, # 源文档 "winner_sum": { "text": <str>, "policy": <str>, "annotation": <dict>, # GPT-4 对提议标准的注释 "preference_factors": <list> # 每个摘要的最终偏好因素列表 } "defeated_sum": { "text": <str>, "policy": <str>, "annotation": <dict>, "preference_factors": <list> } }

使用方法

从本地加载

下载 train.json 到本地文件夹。

python import json

data = []

with open(<PATH_JSON_DATA>, r) as r: for line in r: data.append(json.loads(line.strip())) print(data[0])

5,000+

优质数据集

54 个

任务类型

进入经典数据集