GIEBench
收藏魔搭社区2025-11-25 更新2025-06-14 收录
下载链接:
https://modelscope.cn/datasets/m-a-p/GIEBench
下载链接
链接失效反馈官方服务:
资源简介:
## Introduction
**GIE-Bench**
We introduce GIEBench, a comprehensive benchmark that includes 11 identity dimensions, covering 97 group identities with a total of 999 single-choice questions related to specific group identities. GIEBench is designed to evaluate the empathy of LLMs when presented with specific group identities such as gender, age, occupation, and race, emphasizing their ability to respond from the standpoint of the identified group. The detailed statistical information can be found in the image below.
<div style="text-align: center;">
<img src="img/item.png" width="80%">
</div>
Initially, a collection of controversial topics is developed using web resources, manual selection, and GPT-4, each corresponding to a specific identity. Subsequently, we annotate attitude labels from the perspectives of these identities. We also utilize GPT-4 to generate four responses for each topic, ensuring that only one response aligns with the identity's stance. Finally, using the established identities, topics, and responses, we design three types of prompts to LLMs in selecting the most appropriate response. In the COT-Prompt, a Chain of Thought (COT) is provided along with identity information. In the ID-Prompt, only the identity is disclosed, while the Raw-Prompt includes no additional information.The detailed process can be found in the image below.
<div style="text-align: center;">
<img src="img/pipline.png" width="80%">
</div>
We analyze the extent to which LLMs understand the standpoint of the given identity by comparing the difference in accuracy between CoT-Prompt and Raw-Prompt.
<div style="text-align: center;">
<img src="img/Raw2COT.png" width="50%">
</div>
We analyze the empathy of LLMs towards the given identity standpoint by comparing the difference in accuracy between ID-Prompt and Raw-Prompt.
<div style="text-align: center;">
<img src="img/Raw2ID.png" width="50%">
</div>
The results revealed that although certain LLMs can largely understand the user's identity standpoint, they do not spontaneously exhibit empathy when not explicitly instructed to consider the user's perspective. This highlights the shortcomings of current alignment techniques.
```
## 引言
**GIE-Bench**
我们提出GIEBench,这是一款涵盖11个身份维度、包含97个群体身份的综合性基准测试集,总计包含999道与特定群体身份相关的单项选择题。GIEBench旨在评估大语言模型(Large Language Model,LLM)在面对性别、年龄、职业、种族等特定群体身份时的共情能力,重点考察其从目标群体立场出发进行回应的能力。详细统计信息详见下图。
<div style="text-align: center;">
<img src="img/item.png" width="80%">
</div>
首先,我们通过网络资源检索、人工筛选以及GPT-4生成,构建了一组有争议的话题集合,每个话题均对应某一特定身份。随后,我们从该身份的视角标注态度标签。我们还利用GPT-4为每个话题生成4条回应,确保其中仅有1条符合该身份的立场。最后,基于已确定的身份、话题及回应,我们为大语言模型设计了三类用于选择最优回应的提示词。其中,思维链提示词(Chain of Thought,COT-Prompt)会附带身份信息与思维链内容;身份提示词(ID-Prompt)仅披露目标身份;原始提示词(Raw-Prompt)则不添加任何额外信息。详细流程详见下图。
<div style="text-align: center;">
<img src="img/pipline.png" width="80%">
</div>
我们通过对比CoT-Prompt与Raw-Prompt的准确率差异,分析大语言模型对给定身份立场的理解程度。
<div style="text-align: center;">
<img src="img/Raw2COT.png" width="50%">
</div>
我们通过对比ID-Prompt与Raw-Prompt的准确率差异,分析大语言模型对给定身份立场的共情能力。
<div style="text-align: center;">
<img src="img/Raw2ID.png" width="50%">
</div>
实验结果表明,尽管部分大语言模型能够在较大程度上理解用户的身份立场,但在未被明确要求考虑用户视角的情况下,它们并不会自发展现出共情能力。这一发现凸显了当前对齐技术存在的不足。
提供机构:
maas
创建时间:
2024-09-30



