Data Sheet 1_Evaluating Chain-of-Thought reasoning in large language models for thyroid ultrasound interpretation: a dual-information approach.docx

NIAID Data Ecosystem2026-05-10 收录

下载链接：

https://figshare.com/articles/dataset/Data_Sheet_1_Evaluating_Chain-of-Thought_reasoning_in_large_language_models_for_thyroid_ultrasound_interpretation_a_dual-information_approach_docx/31832179

下载链接

链接失效反馈

官方服务：

资源简介：

ObjectiveTo assess whether reasoning-capable large language models (LLMs) can accurately interpret both qualitative and quantitatively encoded ultrasound features of thyroid nodules within the ACR-TIRADS framework and improve diagnostic reliability. MethodsThis retrospective study analyzed thyroid nodules with both radiologist-labeled qualitative ultrasound features and quantitatively encoded descriptors generated through standardized numerical modeling. Both formats were converted into structured prompts and input separately into four CoT-enabled LLMs (ChatGPT-O3, Grok-3, DeepSeek-R1, Gemini-2.5 Pro), each performing three reasoning rounds per task. Diagnostic performance was evaluated by accuracy and reproducibility, and two types of inconsistencies—cross-threshold and cross-modal conflicts—were quantified. Reasoning authenticity and conciseness were independently assessed by radiologists of varying experience. Sankey diagrams were used to summarize ACR-TIRADS category transitions. ResultsChatGPT-O3, Gemini-2.5 Pro, and Grok-3 showed strong ACR-TIRADS accuracy (91, 96, 96%), outperforming DeepSeek-R1 (79%). Grok-3 was highest in score-based accuracy (96%); DeepSeek-R1 lowest (52%). Reproducibility for categorization was Grok-3 93%, Gemini-2.5 Pro 90%, ChatGPT-O3 88%, vs. DeepSeek-R1 67%. For scoring reproducibility, Grok-3 (93%), ChatGPT-O3 (90%), and Gemini-2.5 Pro (79%) exceeded DeepSeek-R1 (18%). Physicians rated Grok-3 and Gemini-2.5 Pro highest in reasoning authenticity, while ChatGPT-O3 was most concise (mean 144 words). For quantitative tasks, Gemini-2.5 Pro (78%) and DeepSeek-R1 (74%) were most accurate; Grok-3 lowest (64%). Reproducibility was highest for Gemini-2.5 Pro (84%) and DeepSeek-R1 (86%). Across models, the proportion of nodules exhibiting cross-threshold discrepancies ranged from 3 to 17%, with Grok-3 lowest and DeepSeek-R1 highest. Cross-modal conflicts were more frequent, ranging from 27 to 36% across the four LLMs. ConclusionGrok-3 excelled in qualitative tasks, while Gemini-2.5 Pro and DeepSeek-R1 showed strengths in quantitative analysis. CoT-enabled LLMs offered interpretable reasoning with promise for clinical decision support.

创建时间：

2026-03-23