Table 2_Evaluation of the accuracy of large language models in answering bone cancer-related questions.xlsx

NIAID Data Ecosystem2026-05-10 收录

下载链接：

https://figshare.com/articles/dataset/Table_2_Evaluation_of_the_accuracy_of_large_language_models_in_answering_bone_cancer-related_questions_xlsx/30797480

下载链接

链接失效反馈

官方服务：

资源简介：

IntroductionLarge Language Models (LLMs) excel at understanding medical terminology, parsing unstructured clinical data, and generating contextually relevant insights, emerging as transformative healthcare tools. Three leading LLMs—Deepseek, ChatGPT, and Grok—show great potential for medical education, clinical decision-making, and patient care. Bone cancer includes diverse primary and metastatic tumors, each with distinct diagnostic criteria, treatment pathways, and prognoses. Based on this guideline, this study assesses the accuracy of Deepseek V3.1, ChatGPT 5, and Grok 4 in addressing bone cancer-related questions. MethodsBased on the clinical guidelines for bone cancer released by the NCCN in April 2025, 52 questions related to bone cancer were developed. Researchers posed questions to Deepseek V3.1, ChatGPT 5, and Grok 4, and collected the data generated; each LLM was queried twice within a one-month period. The collected data were independently evaluated and scored by two bone cancer-treating specialists in accordance with the scoring criteria. ResultsAmong the answers to the 52 bone cancer-related questions, the probability of Deepseek V3.1, ChatGPT 5, and Grok 4 providing correct responses in both rounds was greater than 90%. Additionally, no correlation was observed between the LLMs’ scores, word count, and response times. The total scores of Deepseek V3.1, ChatGPT 5, and Grok 4 were 3.75 ± 0.71, 3.81 ± 0.6, and 0.87 ± 0.51, respectively. The word count of responses from Deepseek V3.1, ChatGPT 5, and Grok 4 was 546.56 ± 194.49, 367.02 ± 273.18, and 194.16 ± 197.07 words, respectively. The response times of Deepseek V3.1, ChatGPT 5, and Grok 4 were 11.83 ± 3.41 s, 1.52 ± 0.52 s and 42.48 ± 26.89 s, respectively. No statistically significant differences in scores were found for any of the LLMs between the two rounds. However, ChatGPT 5 showed a statistically significant difference in word count between the two rounds (360.12 ± 279.89 vs. 373.94 ± 268.86 words). ConclusionWhen answering bone cancer-related questions, Deepseek V3.1, ChatGPT 5, and Grok 4 generally performed well. Specifically, when responding to questions about Ewing sarcoma, ChatGPT 5 and Grok 4 demonstrated higher accuracy than Deepseek V3.1. While each model has its own strengths and limitations, their collective potential to enhance medical knowledge and improve healthcare outcomes is undeniable.

创建时间：

2025-12-05