AI4HealthResearch/MedMisBench
收藏Hugging Face2026-04-21 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/AI4HealthResearch/MedMisBench
下载链接
链接失效反馈官方服务:
资源简介:
MedMisBench是一个用于评估大型语言模型在引入误导性医疗上下文时是否仍能保持正确医学判断的基准。该基准构建自五个医疗问答来源,涵盖标准医学推理、专家推理、患者旅程场景和代理生物医学能力。每个基准项目包含一个源多项选择题、正确答案以及与答案选项对齐的结构化误导性注入。误导性上下文沿着两个轴组织:注入内容(五种内容破坏类型)和注入来源(三种来源框架)。数据集包含10,932个多项选择项目,分布在五个基准组件中:MEDMISQA(3,112项)、MEDMISMCQA(3,986项)、MEDMISXPERTQA(1,544项)、MEDMISJOURNEY(2,197项)和MEDMISHLE(93项)。数据集主要用于评估,包括在误导性上下文下的医疗问答、医疗和健康相关LLM的鲁棒性评估等。数据集的语言主要为英语,部分为中文。
MedMisBench is a benchmark for evaluating whether large language models preserve the correct medical judgment when misleading medical context is introduced into a task. The benchmark is built from five medical question-answering sources spanning standard medical reasoning, expert reasoning, patient-journey scenarios, and agentic biomedical capability. Each benchmark item contains a source multiple-choice question, the correct answer, and structured misleading injections aligned to the answer options. The misleading context is organized along two axes: injection_content (five content-corruption types) and injection_provenance (three provenance framings). The released benchmark contains 10,932 multiple-choice items across five benchmark components: MEDMISQA (3,112 items), MEDMISMCQA (3,986 items), MEDMISXPERTQA (1,544 items), MEDMISJOURNEY (2,197 items), and MEDMISHLE (93 items). The benchmark is intended primarily for evaluation, including multiple-choice medical question answering under misleading context, robustness evaluation for medical and health-adjacent LLMs, etc. The release contains mostly English items together with a subset of Chinese items.
提供机构:
AI4HealthResearch



