declare-lab/KAIROS_EVAL
收藏Hugging Face2025-08-31 更新2025-09-13 收录
下载链接:
https://hf-mirror.com/datasets/declare-lab/KAIROS_EVAL
下载链接
链接失效反馈官方服务:
资源简介:
KAIROS_EVAL是一个用于评估大型语言模型(LLM)在多智能体社交互动场景中鲁棒性的基准数据集。它通过捕捉模型的原始信念(答案+置信度)并模拟通过人工代理的同伴影响来动态构建每个模型的评估设置。该数据集支持多种任务,包括多选问答、鲁棒性评估、效用与抗力分析等。数据集分为推理、知识、常识和创造力四个领域,包含10,000个训练实例和3,000个测试实例。
KAIROS_EVAL is a benchmark dataset designed to evaluate the robustness of large language models (LLMs) in multi-agent socially interactive scenarios. It constructs evaluation settings for each model dynamically by capturing its original belief (answer + confidence) and simulating peer influence through artificial agents with varying reliability. The dataset supports various tasks such as multiple-choice QA, robustness evaluation, utility and resistance analysis, and includes domains of reasoning, knowledge, common sense, and creativity with a total of 10,000 training instances and 3,000 test instances.
提供机构:
declare-lab



