five

Supplementary file 1_Benchmarking GPT-5 in radiation oncology: measurable gains, but persistent need for expert oversight.pdf

收藏
NIAID Data Ecosystem2026-05-10 收录
下载链接:
https://figshare.com/articles/dataset/Supplementary_file_1_Benchmarking_GPT-5_in_radiation_oncology_measurable_gains_but_persistent_need_for_expert_oversight_pdf/30856763
下载链接
链接失效反馈
官方服务:
资源简介:
IntroductionLarge language models (LLM) have shown great potential in clinical decision support and medical education. GPT-5 is a novel LLM system that has been specifically marketed towards oncology use. This study comprehensively benchmarks GPT-5 for the field of radiation oncology. MethodsPerformance was assessed using two complementary benchmarks: (i) the American College of Radiology Radiation Oncology In-Training Examination (TXIT, 2021), comprising 300 multiple-choice items, and (ii) a curated set of 60 authentic radiation oncologic vignettes representing diverse disease sites and treatment indications. For the vignette evaluation, GPT-5 was instructed to generate structured therapeutic plans and concise two-line summaries. Four board-certified radiation oncologists independently rated outputs for correctness, comprehensiveness, and hallucinations. Inter-rater reliability was quantified using Fleiss’ κ. GPT-5–14 results were compared to published GPT-3.5 and GPT-4 baselines. ResultsOn the TXIT benchmark, GPT-5 achieved a mean accuracy of 92.8%, outperforming GPT-4 (78.8%) and GPT-3.5 (62.1%). Domain-specific gains were most pronounced in dose specification and diagnosis. In the vignette evaluation, GPT-5’s treatment recommendations were rated highly for correctness (mean 3.24/4, 95% CI: 3.11–3.38) and comprehensiveness (3.59/4, 95% CI: 3.49–3.69). Hallucinations were rare, flagged in 10.0% of all individual reviewer assessments (24 of 240), and no patient case reached majority consensus for their presence. Inter-rater agreement was low (Fleiss’ κ 0.083 for correctness), reflecting inherent variability in clinical judgment. Errors clustered in complex scenarios requiring precise trial knowledge or detailed clinical adaptation. DiscussionGPT-5 clearly outperformed prior model variants on the radiation oncology multiple-choice benchmark. Although GPT-5 exhibited favorable performance in generating real-world radiation oncology treatment recommendations, correctness ratings indicate room for further improvement. While hallucinations were infrequent, the presence of substantive errors underscores that GPT-5-generated recommendations require rigorous expert oversight before clinical implementation. In addition, considerable inter-rater variability highlights the challenge of achieving consistent expert evaluation.
创建时间:
2025-12-11
二维码
社区交流群
二维码
科研交流群
商业服务