maga666/reflexbench
收藏Hugging Face2026-04-24 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/maga666/reflexbench
下载链接
链接失效反馈官方服务:
资源简介:
ReflexBench是第一个旨在评估大型语言模型中反射性推理能力的基准测试,即分析自身对环境的因果影响的能力。该数据集包含20个场景,覆盖6个领域(金融市场、政策与治理、社会技术、医疗保健、自主系统、教育与劳动),每个场景探测4个观察者深度级别(OD-0到OD-n),共计80个评估点。数据集通过两阶段评分协议(自动预评分和人工校准)进行评估,并展示了9个大型语言模型在不同OD级别上的表现。
ReflexBench is the first benchmark designed to evaluate reflexive reasoning in large language models — the capacity to reason about ones own causal impact on the environment being analyzed. The dataset consists of 20 scenarios across 6 domains (Financial Markets, Policy & Governance, Social Technology, Healthcare, Autonomous Systems, Education & Labor), each probing 4 levels of Observer Depth (OD-0 to OD-n), totaling 80 evaluation points. The dataset is evaluated via a two-stage scoring protocol (automated pre-scoring and human calibration) and presents results for 9 LLMs across different OD levels.
提供机构:
maga666



