Table 3_Diagnostic performance of large language models on the NEJM image challenge: a comparative study with human evaluators and the impact of prompt engineering.docx
收藏NIAID Data Ecosystem2026-05-10 收录
下载链接:
https://figshare.com/articles/dataset/Table_3_Diagnostic_performance_of_large_language_models_on_the_NEJM_image_challenge_a_comparative_study_with_human_evaluators_and_the_impact_of_prompt_engineering_docx/31200685
下载链接
链接失效反馈官方服务:
资源简介:
IntroductionMultimodal large language models (LLMs) that can interpret clinical text and images are emerging as potential decision-support tools, yet their accuracy on standardized cases and how it compares with human performance across different difficulty levels remains largely unclear. This study aimed to rigorously evaluate the performance of four leading LLMs on the 200-item New England Journal of Medicine (NEJM) Image Challenge.
MethodsWe assessed OpenAI o4-mini-high, Claude 4 Opus, Gemini 2.5 Pro, and Qwen 3, and benchmarked the top model against three medical students (Years 5–7) and an internal-medicine attending physician under identical test conditions. Additionally, we characterized the dominant error types for OpenAI o4-mini-high and tested prompt engineering strategies for potential correction.
ResultsOur results suggest that OpenAI o4-mini-high achieved the highest overall accuracy of 94%. Its performance remained consistently high across easy, moderate, and difficult cases. The human accuracies in this cohort ranged from 38.5% for three medical students to 70.5% for an attending physician—all significantly lower than OpenAI o4-mini-high. An analysis of OpenAI o4-mini-high’s 12 errors revealed that most (83.3%) were outputs reflecting lapses in diagnostic logic rather than input processing. Notably, simple prompting techniques like chain-of-thought and few-shot learning corrected over half of these initial errors.
ConclusionWithin the context of this standardized challenge, a leading multimodal LLM delivered high diagnostic accuracy that surpassed the scores of both peer models and the recruited human participants. However, these results should be interpreted as evidence of pattern recognition capabilities rather than human-like clinical understanding. While further validation on real-world data is warranted, these findings support the potential utility of LLMs in educational and standardized settings, highlighting that most residual errors are due to logic gaps that can be partly mitigated by refined user prompting, emphasizing the importance of human-AI interaction for maximizing reliability.
引言:能够解读临床文本与影像的多模态大语言模型(Large Language Model, LLM)正逐渐成为潜在的临床决策支持工具,但目前学界对其在标准化病例中的准确率,以及不同难度病例下其性能与人类表现的对比情况仍知之甚少。本研究旨在针对《新英格兰医学杂志》(New England Journal of Medicine, NEJM)影像挑战赛的200道试题,严格评估四款主流LLM的性能表现。
方法:本研究选取了OpenAI o4-mini-high、Claude 4 Opus、Gemini 2.5 Pro以及Qwen 3四款模型进行测试,并在完全一致的测试条件下,将性能最优的模型与三名5~7年级医学生及一名内科主治医师进行性能对比。此外,本研究还对OpenAI o4-mini-high的主要错误类型进行了特征分析,并测试了提示工程策略以纠正其错误。
结果:本研究结果显示,OpenAI o4-mini-high的整体准确率最高,达到94%;其在简单、中等及困难病例中的性能均保持稳定高水平。本次受试人群的准确率范围为38.5%(三名医学生)至70.5%(一名主治医师),所有人类受试者的准确率均显著低于OpenAI o4-mini-high。对OpenAI o4-mini-high的12处错误进行分析后发现,其中绝大多数(83.3%)的错误输出源于诊断逻辑疏漏,而非输入处理环节的问题。值得注意的是,思维链(Chain-of-Thought)、少样本学习(Few-Shot Learning)等简易提示工程策略可纠正超过半数的初始错误。
结论:在本次标准化挑战赛的场景下,一款主流多模态LLM展现出了极高的诊断准确率,其性能优于所有同侪模型及本次招募的人类受试者。但需注意,本研究结果仅能证明该模型具备模式识别能力,而非具备类似人类的临床认知能力。尽管仍需在真实世界数据中开展进一步验证,但本研究结果证实了LLM在教学及标准化场景中的潜在应用价值;同时也指出,多数剩余错误源于逻辑缺口,可通过优化用户提示策略部分缓解,凸显了人机交互对提升模型可靠性的重要意义。
创建时间:
2026-01-30



