supreme-lab/HALT_Benchmark_0.1_v1
收藏Hugging Face2026-04-24 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/supreme-lab/HALT_Benchmark_0.1_v1
下载链接
链接失效反馈官方服务:
资源简介:
HALT基准数据集v1.0是一个用于评估在部分可观测性、工具受限和明确升级选项下的有限代理决策的基准。它基于防御性网络安全工作流程,其中过早行动、未能升级或过度升级都可能带来高昂代价。该数据集包含1,248个实例,分布在四个决策制度中,每个类别平衡为312个实例,并按照三个难度级别和五个工作流程家族进行分层。数据集旨在帮助研究语言代理在何时应停止、调查、升级或拒绝行动的决策能力。
HALT Benchmark Dataset v1.0 is a benchmark for evaluating bounded agentic decision-making under partial observability, constrained tools, and explicit escalation options. It is grounded in defensive cybersecurity workflows, where acting too early, failing to escalate, or over-escalating can all be costly. The benchmark contains 1,248 instances across four decision regimes, balanced at 312 instances per class, stratified across three difficulty levels and five workflow families. It is designed to assess when language agents should stop, investigate, escalate, or refuse actions.
提供机构:
supreme-lab



