Table 1_Evaluation of large language models for PI-RADS score extraction from free-text prostate MRI reports: a comparative study with human readers.docx

NIAID Data Ecosystem2026-05-10 收录

下载链接：

https://figshare.com/articles/dataset/Table_1_Evaluation_of_large_language_models_for_PI-RADS_score_extraction_from_free-text_prostate_MRI_reports_a_comparative_study_with_human_readers_docx/31980114

下载链接

链接失效反馈

官方服务：

资源简介：

ObjectiveThis study aimed to evaluate the ability of GPT-4o and Gemini 2.5 Pro to extract and assign PI-RADS v2.1 score from free-text prostate MRI reports, and compare their performance with human readers of varied experience. MethodsThree radiologists with differing levels of experience (resident, fellow, expert) independently reviewed the reports and assigned PI-RADS v2.1 scores. The same reports were processed through prompts with the GPT-4o and Gemini 2.5 Pro. Inter-rater agreement was evaluated using Gwet’s AC1 coefficient, and the diagnostic performance was assessed using sensitivity, specificity, and area under the receiver operating characteristic curve (AUC). ResultsInter-rater agreement between human experts was highest between the expert and fellow (Gwet’s AC1 = 0.68, 95% CI 0.61-0.75), which was significantly higher than between two LLMs (Gwet’s AC1 = 0.52, 95% CI 0.44-0.59, P = 0.004). The agreement between expert and GPT (Gwet’s AC1 = 0.42, 95% CI 0.34-0.51) was lower than between expert and Gemini (Gwet’s AC1 = 0.49, 95% CI 0.41-0.57, P = 0.17). The AUCs for resident, fellow, and expert readers were 0.81 (95% CI 0.76-0.87), 0.86 (95% CI 0.81-0.91), and 0.89 (95% CI 0.85-0.93), and for GPT and Gemini were 0.85 (95% CI 0.81-0.90) and 0.84 (95% CI 0.80-0.89), respectively. ConclusionLLMs demonstrated promising performance in assigning PI-RADS scores from free-text prostate MRI reports, with accuracy and agreement approaching that of general radiologists; however, they are not yet ready to replace expert interpretation in high-stakes clinical settings. Nevertheless, these findings support its potential as a supplementary tool for report standardization and trainee education.

创建时间：

2026-04-10

5,000+

优质数据集

54 个

任务类型

进入经典数据集