five

Evaluating LLM Alignment With Human Trust Models

收藏
DataCite Commons2025-11-17 更新2026-04-25 收录
下载链接:
https://figshare.com/articles/dataset/Evaluating_LLM_Alignment_With_Human_Trust_Models/30642632
下载链接
链接失效反馈
官方服务:
资源简介:
Trust plays a pivotal role in enabling effective cooperation, reducing uncertainty, and guiding decision-making in both human interactions and multi-agent systems. Although it is significant, there is limited understanding of how large language models (LLMs) internally conceptualize and reason about trust. This work presents a white-box analysis of trust representation in EleutherAI/gpt-j-6B, using contrastive prompting to generate embedding vectors within the activation space of the LLM for diadic trust and related interpersonal relationship attributes. We first identified trust-related concepts from five established trust models. We then determined a threshold for significant conceptual alignment by computing pairwise cosine similarities across 60 general emotional concepts. Then we measured the cosine similarities between the LLM’s internal representation of trust and the derived trust-related concepts. Our results show that EleutherAI/gpt-j-6B effectively separates opposing concepts while grouping related ones, and that its internal trust representations aligned most closely with the Castelfranchi socio-cognitive model, followed by the McAllister Model. These findings indicate that LLMs encode socio-cognitive constructs in their activation space in ways that support meaningful comparative analysis, enabling quantitative analysis of trust dynamics and supporting future research in multi-agent systems and trustworthy AI.

信任在人际交互与多智能体系统(multi-agent systems)中均扮演着核心关键角色,可推动高效协作、降低不确定性并指导决策制定。尽管其重要性毋庸置疑,但目前学界对大语言模型(Large Language Models,LLMs)如何在内部构建信任概念并开展信任推理的相关认知仍较为匮乏。本研究针对EleutherAI/gpt-j-6B模型的信任表征展开白盒分析,借助对比提示在该大语言模型的激活空间(activation space)内生成针对二元信任(diadic trust)及相关人际关系属性的嵌入向量(embedding vectors)。我们首先从五种已被广泛认可的信任模型中提取出与信任相关的概念;随后通过计算60个通用情感概念间的成对余弦相似度,确定了衡量概念显著匹配度的阈值,并进一步计算了该大语言模型内部的信任表征与提取出的信任相关概念间的余弦相似度。实验结果表明,EleutherAI/gpt-j-6B模型可有效区分对立概念,同时将相关概念归为一类;其内部的信任表征与卡斯泰尔弗兰奇社会认知模型(Castelfranchi socio-cognitive model)的匹配程度最高,紧随其后的是麦卡利斯特模型(McAllister Model)。上述研究结果表明,大语言模型可在其激活空间中编码社会认知建构,且这种编码方式支持开展有意义的对比分析,进而实现对信任动态的量化分析,同时可为多智能体系统与可信人工智能领域的后续研究提供支撑。
提供机构:
figshare
创建时间:
2025-11-17
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作