five

sammydman/KnowDoBench

收藏
Hugging Face2026-04-22 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/sammydman/KnowDoBench
下载链接
链接失效反馈
官方服务:
资源简介:
KnowDoBench是一个经过医师验证的数据集,用于评估大型语言模型(LLMs)是否正确回答或拒绝临床任务。每个案例都有确定性的真实答案:模型必须要么生成正确的数值答案,要么选择放弃。数据集设计使得正确行为需要识别任务何时无效并据此采取行动。无需主观评分或基于LLM的评估。KnowDoBench包含217个案例,分为四个轨道:可解决(solvable)、认知(epistemic)、规范(normative)和规范控制(normative_control),每个轨道有特定的预期行为。数据集结构清晰,支持双向评估,既惩罚过度回答也惩罚过度拒绝。所有基础场景均由两名获得委员会认证的医师(内科/信息学;急诊医学/伦理学)独立验证。数据集还支持结构化失败分析,通过标签和跟踪功能实现分层分析。

KnowDoBench is a physician-validated dataset for evaluating whether LLMs correctly answer or correctly refuse clinical tasks. Each case has deterministic ground truth: the model must either produce a correct numerical answer or abstain. The dataset is designed so that correct behavior requires both recognizing when a task is invalid and acting on that recognition. No subjective grading or LLM-based evaluation is required. KnowDoBench consists of 217 cases across four tracks: solvable, epistemic, normative, and normative_control, each with specific expected behaviors. The dataset features bidirectional evaluation, penalizing both over-answering and over-refusal. All base scenarios were authored de novo and independently validated by two board-certified physicians (Internal Medicine/Informatics; Emergency Medicine/Ethics). The dataset also supports structured failure visibility through track and tag labels, enabling stratified analysis of when and how models fail.
提供机构:
sammydman
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作