Comparative evaluation of ChatGPT-4o and DeepSeek-V3 in head and neck oncology
收藏NIAID Data Ecosystem2026-05-10 收录
下载链接:
https://figshare.com/articles/dataset/Comparative_evaluation_of_ChatGPT-4o_and_DeepSeek-V3_in_head_and_neck_oncology/30472568
下载链接
链接失效反馈官方服务:
资源简介:
Large language models (LLMs) are increasingly used in clinical decision-making and patient education, including in complex specialties such as head and neck cancer (HNC).
To evaluate the performance of ChatGPT-4o and DeepSeek-V3 in answering HNC-related clinical questions.
A set of 154 questions across six clinical categories was submitted twice to both models. Responses were independently graded by head and neck surgeons using a four-point accuracy scale. Accuracy, reproducibility, and inter-model agreement were assessed.
ChatGPT-4o and DeepSeek-V3 provided ‘’comprehensive/correct’’ answers in 92.2% and 89.6% of cases, respectively (p = .42). The accuracy ratings of both models’ responses overlapped in 85.1% of cases; however, the statistical agreement between them remained low (Cohen’s κ = 0.12; ICC = 0.21, p = .006). DeepSeek-V3 outperformed ChatGPT in Treatment category (96.3% vs. 81.5%, p = .08), while ChatGPT excelled in Recovery, Complications, and Follow-up (95.0% vs. 82.5%, p = .08); however, these differences did not reach statistical significance. Reproducibility was high for both models (ChatGPT-4o: 96.1%; DeepSeek-V3: 96.8%).
Both models demonstrated strong accuracy and consistency in HNC-related queries.
LLMs hold promise as reliable tools in clinical decision-making and patient education within HNCs when used with careful consideration of their inherent limitations.
创建时间:
2025-10-29



