Table 1_LLM evaluation for thyroid nodule assessment: comparing ACR-TIRADS, C-TIRADS, and clinician-AI trust gap.xlsx
收藏NIAID Data Ecosystem2026-05-10 收录
下载链接:
https://figshare.com/articles/dataset/Table_1_LLM_evaluation_for_thyroid_nodule_assessment_comparing_ACR-TIRADS_C-TIRADS_and_clinician-AI_trust_gap_xlsx/30229513
下载链接
链接失效反馈官方服务:
资源简介:
ObjectiveTo evaluate the diagnostic performance and clinical utility of advanced large language models (LLMs) -GPT-4o, GPT-o3-mini, and DeepSeek-R1- in stratifying thyroid nodule malignancy risk and generating guideline-aligned management recommendations based on structured narrative ultrasound descriptions.
MethodsThis diagnostic modeling study evaluated three LLMs—GPT-4o, GPT-o3-mini, and DeepSeek-R1—using standardized narrative ultrasound descriptors. These descriptors were annotated by consensus among three senior board-certified sonologists and processed independently in a stateless manner to ensure unbiased outputs. LLM outputs were assessed under both ACR-TIRADS and C-TIRADS frameworks. Two experienced clinicians (a thyroid surgeon and an endocrinologist) independently rated the outputs across five clinical dimensions using 5-point Likert scales. Primary outcomes included the area under the receiver operating characteristic curve (AUC) for malignancy prediction, and clinician ratings of guideline adherence, patient safety, operational feasibility, clinical applicability, and overall performance.
ResultsGPT-4o achieved the highest predictive AUC (0.898) under C-TIRADS, approaching expert-level accuracy. DeepSeek-R1, particularly with C-TIRADS, received the highest clinician ratings (mean Likert: surgeon 4.65, endocrinologist 4.63), reflecting greater trust in its practical recommendations. Clinicians consistently favored the C-TIRADS framework across all models. GPT-4o and GPT-o3-mini received lower ratings in trustworthiness and recommendation quality, especially from the endocrinologist.
ConclusionWhile GPT-4o demonstrated superior diagnostic accuracy, clinicians most trusted DeepSeek-R1 combined with the C-TIRADS framework for generating practical, guideline-consistent recommendations. The findings highlight the critical need for alignment between AI-generated outputs and clinician expectations, and the importance of incorporating region-specific clinical guidelines (like C-TIRADS) for the effective real-world implementation of LLMs in thyroid nodule management decision support.
研究目的 本研究旨在评估先进大语言模型(Large Language Model, LLM)——包括GPT-4o、GPT-o3-mini以及DeepSeek-R1——基于结构化叙事性超声描述对甲状腺结节恶性风险进行分层,并生成符合指南的诊疗建议的诊断性能与临床实用性。
研究方法 本诊断建模研究采用标准化叙事性超声描述指标,对三款大语言模型(GPT-4o、GPT-o3-mini与DeepSeek-R1)进行评估。上述描述指标由三位资深执业认证超声医师通过共识标注,并以无状态方式独立处理以确保输出无偏倚。所有大语言模型的输出均在ACR-TIRADS与C-TIRADS两种框架下进行评估。两名经验丰富的临床医师(一名甲状腺外科医师与一名内分泌科医师)采用5点李克特量表,从5个临床维度独立对模型输出进行评分。主要结局指标包括恶性肿瘤预测的受试者工作特征曲线下面积(AUC),以及临床医师对模型输出的指南依从性、患者安全性、操作可行性、临床适用性与综合性能的评分。
研究结果 在C-TIRADS框架下,GPT-4o的预测AUC最高(0.898),接近专家级准确率。DeepSeek-R1(尤其结合C-TIRADS框架)获得了最高的临床医师评分(李克特量表平均分:外科医师4.65分,内分泌科医师4.63分),反映出临床医师对其实用建议的认可度更高。所有模型在临床医师中均更倾向于使用C-TIRADS框架。GPT-4o与GPT-o3-mini在可信度与建议质量维度的评分较低,尤其是在内分泌科医师的评分中更为明显。
研究结论 尽管GPT-4o展现出更优的诊断准确率,但临床医师最信赖结合C-TIRADS框架的DeepSeek-R1所生成的实用且符合指南的建议。本研究结果凸显了AI生成输出与临床医师预期保持一致的关键必要性,以及纳入区域特异性临床指南(如C-TIRADS)对于大语言模型在甲状腺结节诊疗决策支持中有效落地的重要性。
创建时间:
2025-09-29



