Comparative evaluation of ChatGPT-4o and DeepSeek-V3 in head and neck oncology

NIAID Data Ecosystem2026-05-10 收录

下载链接：

https://figshare.com/articles/dataset/Comparative_evaluation_of_ChatGPT-4o_and_DeepSeek-V3_in_head_and_neck_oncology/30472568

下载链接

链接失效反馈

官方服务：

资源简介：

Large language models (LLMs) are increasingly used in clinical decision-making and patient education, including in complex specialties such as head and neck cancer (HNC). To evaluate the performance of ChatGPT-4o and DeepSeek-V3 in answering HNC-related clinical questions. A set of 154 questions across six clinical categories was submitted twice to both models. Responses were independently graded by head and neck surgeons using a four-point accuracy scale. Accuracy, reproducibility, and inter-model agreement were assessed. ChatGPT-4o and DeepSeek-V3 provided ‘’comprehensive/correct’’ answers in 92.2% and 89.6% of cases, respectively (p = .42). The accuracy ratings of both models’ responses overlapped in 85.1% of cases; however, the statistical agreement between them remained low (Cohen’s κ = 0.12; ICC = 0.21, p = .006). DeepSeek-V3 outperformed ChatGPT in Treatment category (96.3% vs. 81.5%, p = .08), while ChatGPT excelled in Recovery, Complications, and Follow-up (95.0% vs. 82.5%, p = .08); however, these differences did not reach statistical significance. Reproducibility was high for both models (ChatGPT-4o: 96.1%; DeepSeek-V3: 96.8%). Both models demonstrated strong accuracy and consistency in HNC-related queries. LLMs hold promise as reliable tools in clinical decision-making and patient education within HNCs when used with careful consideration of their inherent limitations.

创建时间：

2025-10-29