five

SocialMaze

收藏
魔搭社区2025-12-04 更新2025-05-17 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/SocialMaze
下载链接
链接失效反馈
官方服务:
资源简介:
> ⚠️ **Notice:** This dataset is no longer maintained under this repository. It has been officially migrated to the MBZUAI organization for ongoing development and updates. > 👉 Access the latest version here: [MBZUAI/SocialMaze](https://huggingface.co/datasets/MBZUAI/SocialMaze) # SocialMaze Benchmark This dataset is a component of the **SocialMaze: A Benchmark for Evaluating Social Reasoning in Large Language Models** project. It specifically features the **Hidden Role Deduction** task, which we consider one of the most challenging scenarios for testing complex social reasoning, deception handling, and inferential capabilities in Large Language Models (LLMs). We have curated and formatted this task into a convenient question-answering (QA) structure to facilitate direct evaluation of LLMs. For the complete SocialMaze benchmark, including all tasks, evaluation methodologies, and data generation code, please visit our main repository: [https://github.com/xzx34/SocialMaze](https://github.com/xzx34/SocialMaze). ## Dataset Structure Each instance in the dataset represents a unique game scenario presented in a QA format. ### Data Fields Each data point consists of the following fields: * `system_prompt` (string): The system prompt to provide context, rules of the game, and instructions to the LLM. * `prompt` (string): The user prompt detailing the game's progress, including all player statements across the rounds, and posing the two key questions (identify the Criminal and Player 1's true role). * `answer` (string): The ground truth answer, specifying the true Criminal and Player 1's actual role. * `reasoning_process` (string): An algorithmically generated, step-by-step reasoning chain that logically derives the correct `answer`. This "gold standard" thought process is highly pruned and demonstrates one path to the unique solution. It's invaluable for: * Human verification of the problem's solvability and the answer's correctness. * Serving as high-quality examples for few-shot learning or fine-tuning models on complex social reasoning. * `round 1` (string): A consolidated string of all player statements made during round 1. * `round 2` (string): A consolidated string of all player statements made during round 2. * `round 3` (string): A consolidated string of all player statements made during round 3. ### Data Splits & Player 1 Role Distribution The dataset is divided into two configurations based on difficulty: * **`easy` split:** Contains scenarios with 6 players (3 Investigators, 1 Criminal, 1 Rumormonger, 1 Lunatic). * **`hard` split:** Contains scenarios with 10 players (5 Investigators, 1 Criminal, 2 Rumormongers, 2 Lunatics). Across the dataset, the distribution of **Player 1's true role** is approximately: * Investigator: 3% * Criminal: 2% * Rumormonger: 60% * Lunatic: 35% ## How to Use Models can be directly evaluated by providing them with the `system_prompt` and `prompt` fields. The generated response can then be compared against the `answer` field to assess performance on both identifying the criminal and Player 1's role. The `reasoning_process` field can be used for: * Error analysis to understand where a model's reasoning deviates. * Developing chain-of-thought or step-by-step prompting strategies. * As training data for fine-tuning models to improve their deductive social reasoning capabilities. ## Task Description: Hidden Role Deduction The Hidden Role Deduction task is designed to test an LLM's ability to infer hidden information and true identities in a multi-agent scenario fraught with potential deception and misinformation. The game involves several players with distinct roles and behavioral rules: * **Investigators:** Always tell the truth. * **Criminal:** Can choose to lie or tell the truth. Their objective is to remain undetected. * **Rumormongers:** Believe they are Investigators and are shown the 'Investigator' role. However, their statements about other players are randomly true or false, making them unreliable sources of information. * **Lunatics:** Believe they are the Criminal and are shown the 'Criminal' role. Similar to Rumormongers, their statements about others are randomly true or false, adding another layer of confusion. **Gameplay & Objective:** The game involves `n` players, and the LLM always takes on the perspective of **Player 1**. This means the LLM only has access to Player 1's observations and any information Player 1 would typically possess (including their *perceived* role, which might not be their *true* role if Player 1 is a Rumormonger or Lunatic). The game proceeds for `T` rounds (typically 3 in this dataset). In each round: 1. Every player (including Player 1) selects another player. 2. Every player makes a public statement about their selected target (e.g., "Player A claims Player B is the Criminal," or "Player C claims Player D is not the Criminal"). After all rounds of statements are presented, the LLM is tasked with two key objectives: 1. **Identify the true Criminal** among all players. 2. **Infer Player 1's own true role** in the game (which could be Investigator, Criminal, Rumormonger, or Lunatic). The core challenge lies in disentangling truthful statements from lies or random inaccuracies, especially when players might be mistaken about their own identities. This requires sophisticated reasoning about others' potential knowledge, intentions, and consistency. ## Citation If you use this dataset or the SocialMaze benchmark in your research, please cite our work: ```bibtex @article{xu2025socialmaze, title={SocialMaze: A Benchmark for Evaluating Social Reasoning in Large Language Models}, author={Xu, Zixiang and Wang, Yanbo and Huang, Yue and Ye, Jiayi and Zhuang, Haomin and Song, Zirui and Gao, Lang and Wang, Chenxi and Chen, Zhaorun and Zhou, Yujun and Li, Sixian and Pan, Wang and Zhao, Yue and Zhao, Jieyu and Zhang, Xiangliang and Chen, Xiuying}, year={2025}, note={Under review} }

⚠️ **注意:** 本数据集不再在此仓库中维护,已正式迁移至MBZUAI组织下进行后续开发与更新。 👉 请访问以下链接获取最新版本:[MBZUAI/SocialMaze](https://huggingface.co/datasets/MBZUAI/SocialMaze) # SocialMaze 基准测试集 本数据集是**SocialMaze:大语言模型社会推理评估基准**项目的组成部分,其核心特色为**隐藏角色推理(Hidden Role Deduction)**任务——我们认为该任务是测试大语言模型(Large Language Models,LLMs)复杂社会推理、谎言处理与推理能力的极具挑战性的场景之一。 我们已将该任务整理并格式化为便捷的问答(QA)结构,以便直接对大语言模型进行评估。 如需获取包含全部任务、评估方法及数据生成代码的完整SocialMaze基准测试集,请访问我们的主仓库:[https://github.com/xzx34/SocialMaze](https://github.com/xzx34/SocialMaze)。 ## 数据集结构 本数据集中的每个实例均为一个采用问答格式呈现的独特游戏场景。 ### 数据字段 每个数据点包含以下字段: * `system_prompt`(字符串):用于为大语言模型提供上下文、游戏规则与操作指令的系统提示词。 * `prompt`(字符串):用户提示词,详细描述游戏进展,包含所有玩家在各回合的发言,并提出两个核心问题:识别罪犯与玩家1的真实身份。 * `answer`(字符串):标准答案,明确标注真正的罪犯以及玩家1的实际角色。 * `reasoning_process`(字符串):算法生成的、逐步推导的推理链,可逻辑推导得出正确的`answer`。该“金标准”思考过程经过高度精简,展示了通往唯一正确解的一条路径,其价值在于: * 人工验证问题的可解性与答案的正确性。 * 为少样本学习或针对复杂社会推理任务的模型微调提供高质量示例。 * `round 1`(字符串):整合了第一回合所有玩家发言的字符串。 * `round 2`(字符串):整合了第二回合所有玩家发言的字符串。 * `round 3`(字符串):整合了第三回合所有玩家发言的字符串。 ### 数据划分与玩家1角色分布 本数据集根据难度分为两种配置: * **`easy` 划分**:包含6名玩家的场景(3名调查员、1名罪犯、1名造谣者、1名疯子)。 * **`hard` 划分**:包含10名玩家的场景(5名调查员、1名罪犯、2名造谣者、2名疯子)。 本数据集中,**玩家1的真实角色**分布大致如下: * 调查员:3% * 罪犯:2% * 造谣者:60% * 疯子:35% ## 使用方式 可直接向模型提供`system_prompt`与`prompt`字段进行模型评估,随后将生成的回复与`answer`字段进行对比,以评估模型在识别罪犯与玩家1角色两方面的表现。 `reasoning_process`字段可用于: * 错误分析,以理解模型推理出现偏差的环节。 * 开发思维链或逐步提示策略。 * 作为微调数据,用于提升模型的演绎式社会推理能力。 ## 任务说明:隐藏角色推理 隐藏角色推理任务旨在测试大语言模型在充满潜在欺骗与错误信息的多智能体场景中,推断隐藏信息与真实身份的能力。该游戏包含多名拥有不同角色与行为规则的玩家: * **调查员(Investigators)**:始终如实发言。 * **罪犯(Criminal)**:可选择说谎或如实发言,其目标是不被发现。 * **造谣者(Rumormongers)**:自认为是调查员,并被系统分配了“调查员”角色。但他们对其他玩家的发言随机为真或为假,因此信息来源不可靠。 * **疯子(Lunatics)**:自认为是罪犯,并被系统分配了“罪犯”角色。与造谣者类似,他们对其他玩家的发言随机为真或为假,进一步增加了场景的混乱度。 **游戏流程与任务目标:** 游戏包含`n`名玩家,大语言模型始终以**玩家1**的视角参与游戏。这意味着模型仅能获取玩家1的观察结果,以及玩家1通常能够掌握的所有信息(包括其**感知到的**角色——若玩家1为造谣者或疯子,则其感知角色与其**真实**角色并不一致)。 游戏将进行`T`个回合(本数据集中通常为3回合)。每一轮的流程如下: 1. 所有玩家(包括玩家1)选择另一名玩家作为目标。 2. 所有玩家针对其选定的目标发表公开言论(例如:“玩家A称玩家B是罪犯”,或“玩家C称玩家D不是罪犯”)。 在所有回合的发言结束后,大语言模型需要完成两个核心任务: 1. **识别所有玩家中的真正罪犯**。 2. **推断玩家1自身的真实角色**(可为调查员、罪犯、造谣者或疯子)。 该任务的核心挑战在于区分真实陈述、谎言与随机错误的信息,尤其是当玩家可能对自身身份存在认知偏差时。这需要针对其他玩家的潜在知识、意图与言行一致性进行复杂推理。 ## 引用声明 若您在研究中使用本数据集或SocialMaze基准测试集,请引用我们的工作: bibtex @article{xu2025socialmaze, title={SocialMaze: A Benchmark for Evaluating Social Reasoning in Large Language Models}, author={Xu, Zixiang and Wang, Yanbo and Huang, Yue and Ye, Jiayi and Zhuang, Haomin and Song, Zirui and Gao, Lang and Wang, Chenxi and Chen, Zhaorun and Zhou, Yujun and Li, Sixian and Pan, Wang and Zhao, Yue and Zhao, Jieyu and Zhang, Xiangliang and Chen, Xiuying}, year={2025}, note={Under review} }
提供机构:
maas
创建时间:
2025-05-18
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作