sammydman/KnowDoBench

Name: sammydman/KnowDoBench
Creator: sammydman
Published: 2026-04-22 23:54:28
License: 暂无描述

Hugging Face2026-04-22 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/sammydman/KnowDoBench

下载链接

链接失效反馈

官方服务：

资源简介：

KnowDoBench是一个经过医师验证的数据集，用于评估大型语言模型（LLMs）是否正确回答或拒绝临床任务。每个案例都有确定性的真实答案：模型必须要么生成正确的数值答案，要么选择放弃。数据集设计使得正确行为需要识别任务何时无效并据此采取行动。无需主观评分或基于LLM的评估。KnowDoBench包含217个案例，分为四个轨道：可解决（solvable）、认知（epistemic）、规范（normative）和规范控制（normative_control），每个轨道有特定的预期行为。数据集结构清晰，支持双向评估，既惩罚过度回答也惩罚过度拒绝。所有基础场景均由两名获得委员会认证的医师（内科/信息学；急诊医学/伦理学）独立验证。数据集还支持结构化失败分析，通过标签和跟踪功能实现分层分析。

KnowDoBench is a physician-validated dataset for evaluating whether LLMs correctly answer or correctly refuse clinical tasks. Each case has deterministic ground truth: the model must either produce a correct numerical answer or abstain. The dataset is designed so that correct behavior requires both recognizing when a task is invalid and acting on that recognition. No subjective grading or LLM-based evaluation is required. KnowDoBench consists of 217 cases across four tracks: solvable, epistemic, normative, and normative_control, each with specific expected behaviors. The dataset features bidirectional evaluation, penalizing both over-answering and over-refusal. All base scenarios were authored de novo and independently validated by two board-certified physicians (Internal Medicine/Informatics; Emergency Medicine/Ethics). The dataset also supports structured failure visibility through track and tag labels, enabling stratified analysis of when and how models fail.

提供机构：

sammydman

5,000+

优质数据集

54 个

任务类型

进入经典数据集