Reproducibility between rounds.
收藏Figshare2025-12-16 更新2026-04-28 收录
下载链接:
https://figshare.com/articles/dataset/Reproducibility_between_rounds_/30897552
下载链接
链接失效反馈官方服务:
资源简介:
BackgroundThe rapid rise of AI in medical and pharmaceutical education has engendered much interest; however, a knowledge gap still exists in the evaluation of performances of these tools in critical academic contexts.ObjectivesThe aim of this study was to assess and compare the performances of four openly accessible AI language tools, Microsoft Copilot, ChatGPT-3.5, Google Gemini, and DeepSeek AI, in responding to pharmacology-related MCQs with regard to diagnostic accuracy, sensitivity, specificity, and reproducibility.MethodsA total of 80 MCQs were generated and validated, representing four therapeutic systems: cardiovascular, respiratory, gastrointestinal, and endocrine, including four pharmacological domains: mechanism of action, side effects, pharmacokinetics, and drug-drug interactions. Answers were classified into true/false positives and negatives in order to calculate accuracy, sensitivity, and specificity. After two weeks, a second round of testing was performed with the questions to assess answer reproducibility.ResultsThe top overall performer was Microsoft Copilot: 87.5% accuracy, a sensitivity of 94.6%, and a specificity of 70.8%. It continued to perform strongly across all therapeutic systems, especially in the cardiovascular and respiratory domains, with the highest accuracy in identifying drug mechanisms and side effects. ChatGPT-3.5 performed similarly to Google Gemini (76.3% and 75.0% accuracy, respectively) but with higher sensitivity for ChatGPT-3.5 and higher specificity for Gemini. DeepSeek AI had the lowest accuracy overall (68.8%) and the lowest specificity (29.2%), but the highest consistency of reproducibility (97.5%). The performance of all tools decreased significantly with increasing level of question difficulty (p ConclusionAll tools have some value in pharmacology education, but Microsoft Copilot was the most consistently accurate. Limitations in complexity and reproducibility suggest that caution should be exercised in academic and clinical use, particularly given the variability seen with ChatGPT-3.5.
创建时间:
2025-12-16



