Datasheet2_ChatGPT yields low accuracy in determining LI-RADS scores based on free-text and structured radiology reports in German language.docx
收藏NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://figshare.com/articles/dataset/Datasheet2_ChatGPT_yields_low_accuracy_in_determining_LI-RADS_scores_based_on_free-text_and_structured_radiology_reports_in_German_language_docx/26182688
下载链接
链接失效反馈官方服务:
资源简介:
BackgroundTo investigate the feasibility of the large language model (LLM) ChatGPT for classifying liver lesions according to the Liver Imaging Reporting and Data System (LI-RADS) based on MRI reports, and to compare classification performance on structured vs. unstructured reports.
MethodsLI-RADS classifiable liver lesions were included from German written structured and unstructured MRI reports with report of size, location, and arterial phase contrast enhancement as minimum inclusion requirements. The findings sections of the reports were propagated to ChatGPT (GPT-3.5), which was instructed to determine LI-RADS scores for each classifiable liver lesion. Ground truth was established by two radiologists in consensus. Agreement between ground truth and ChatGPT was assessed with Cohen's kappa. Test-retest reliability was assessed by passing a subset of n = 50 lesions five times to ChatGPT, using the intraclass correlation coefficient (ICC).
Results205 MRIs from 150 patients were included. The accuracy of ChatGPT at determining LI-RADS categories was poor (53% and 44% on unstructured and structured reports). The agreement to the ground truth was higher (k = 0.51 and k = 0.44), the mean absolute error in LI-RADS scores was lower (0.5 ± 0.5 vs. 0.6 ± 0.7, p < 0.05), and the test-retest reliability was higher (ICC = 0.81 vs. 0.50), in free-text compared to structured reports, respectively, although structured reports comprised the minimum required imaging features significantly more frequently (Chi-square test, p < 0.05).
ConclusionsChatGPT attained only low accuracy when asked to determine LI-RADS scores from liver imaging reports. The superior accuracy and consistency throughout free-text reports might relate to ChatGPT's training process.
Clinical relevance statementOur study indicates both the necessity of optimization of LLMs for structured clinical data input and the potential of LLMs for creating machine-readable labels based on large free-text radiological databases.
研究背景:本研究旨在探讨大语言模型(Large Language Model,LLM)ChatGPT(GPT-3.5)基于磁共振成像(Magnetic Resonance Imaging,MRI)报告,按照肝脏影像报告与数据系统(Liver Imaging Reporting and Data System,LI-RADS)对肝脏病变进行分类的可行性,并对比其在结构化与非结构化报告上的分类性能。
研究方法:纳入符合LI-RADS分类标准的肝脏病变,数据来自德语书写的结构化与非结构化MRI报告,纳入的最低要求为报告包含病灶大小、位置以及动脉期对比强化表现。将报告的影像学表现部分输入至ChatGPT(GPT-3.5),并要求其为每一处符合分类标准的肝脏病变确定LI-RADS评分。由两名放射科医师达成共识后确立金标准(Ground Truth)。采用科恩kappa系数(Cohen's kappa)评估金标准与ChatGPT分类结果的一致性。通过将n=50的病变子集五次输入ChatGPT,采用组内相关系数(Intraclass Correlation Coefficient,ICC)评估重测信度。
研究结果:本研究共纳入150例患者的205例MRI检查结果。ChatGPT确定LI-RADS类别的准确率较低(非结构化报告与结构化报告分别为53%与44%)。相较于结构化报告,自由文本(非结构化)报告的分类结果与金标准的一致性更高(k=0.51 vs. k=0.44),LI-RADS评分的平均绝对误差更低(0.5±0.5 vs. 0.6±0.7,p<0.05),且重测信度更优(ICC=0.81 vs. 0.50);尽管结构化报告包含最低要求影像学特征的比例显著更高(卡方检验(Chi-square test),p<0.05)。
结论:当要求ChatGPT从肝脏影像报告中推导LI-RADS评分时,其仅能达到较低的准确率。相较于结构化报告,自由文本报告的分类准确率与一致性更优,这可能与ChatGPT的训练过程相关。
临床相关性说明:本研究表明,一方面需要针对结构化临床数据输入优化大语言模型,另一方面大语言模型具备基于大规模放射学自由文本数据库生成机器可读标签的潜力。
创建时间:
2024-07-05



