MIMIC-IV-Ext-MedicalBench: Evaluating Large Language Models Towards Improved Medical Concept Extraction
收藏DataCite Commons2026-03-24 更新2026-05-04 收录
下载链接:
https://physionet.org/content/mimic-iv-ext-medicalbench/1.0.0/
下载链接
链接失效反馈官方服务:
资源简介:
Medical concept extraction from electronic health records underpins many
downstream applications, yet remains challenging because medically meaningful
concepts, such as diagnosis, are frequently implied rather than explicitly
stated in medical narratives. Existing benchmarks with human-annotated
evidence spans underscore the importance of grounding extracted concepts in
medical text. However, they predominantly focus on explicitly stated concepts
and provide limited coverage of cases in which medically relevant concepts
must be inferred. We present MedicalBench, a new benchmark for medical concept
extraction with evidence grounding that evaluates implicit medical reasoning.
MedicalBench formulates concept extraction as a verification task over medical
note-concept pairs, coupled with sentence-level evidence identification. Built
from MIMIC-IV discharge summaries and human-verified ICD-10 codes, the dataset
is curated through a multi-stage large language model (LLM) triage pipeline
followed by dual medical annotation and expert review. It deliberately
includes implicit positives, semantically confusable negatives, and cases
where LLM judgments disagree with human assessments. Annotators provide
sentence-level evidence spans and concise medical rationales. In total, the
dataset contains 405 high-quality examples, covering a broad range of ICD-10
chapters. By providing ground-truth evidence and confusable alternatives,
MedicalBench enables rigorous evaluation of not only _whether_ a model can
extract the correct concept, but also _why_ -- rewarding solutions that can
highlight relevant evidence and reject plausible-but-incorrect diagnosis and
procedures.
提供机构:
PhysioNet
创建时间:
2026-03-18



