CXReasonBench: A Benchmark for Evaluating Structured Diagnostic Reasoning in Chest X-rays
收藏DataCite Commons2025-10-16 更新2026-05-04 收录
下载链接:
https://physionet.org/content/chexstruct-cxreasonbench/1.0.0/
下载链接
链接失效反馈官方服务:
资源简介:
Recent progress in Large Vision-Language Models (LVLMs) has enabled promising
applications in medical tasks such as report generation and visual question
answering. However, existing benchmarks focus mainly on the final diagnostic
answer, offering limited insight into whether models engage in clinically
meaningful reasoning.
To address this, we present CheXStruct and CXReasonBench, a structured
pipeline and benchmark built on the publicly available MIMIC-CXR-JPG dataset.
CheXStruct automatically derives a sequence of intermediate reasoning steps
directly from chest X-rays, such as segmenting anatomical regions, deriving
anatomical landmarks and diagnostic measurements, computing diagnostic
indices, and applying clinical thresholds. CXReasonBench leverages this
pipeline to evaluate whether models can perform clinically valid reasoning
steps and to what extent they can learn from structured guidance, enabling
fine-grained and transparent assessment of diagnostic reasoning. The benchmark
comprises 18,988 QA pairs across 12 diagnostic tasks and 1,200 cases, each
paired with up to four visual inputs, and supports multi-path, multi-stage
evaluation including visual grounding via anatomical region selection and
diagnostic measurements.
This dataset is intended to serve as a standardized resource for developing,
evaluating, and comparing vision-language models on clinically grounded
reasoning tasks in chest X-rays.
提供机构:
PhysioNet
创建时间:
2025-10-14



