Usability and reliability of artificial intelligence platforms in preparation for qualification exams: an experimental study on osteoarthritis

NIAID Data Ecosystem2026-05-02 收录

下载链接：

https://zenodo.org/record/15031729

下载链接

链接失效反馈

官方服务：

资源简介：

Objective: This study aims to investigate the evaluability of artificial intelligence (AI) platforms in terms of scientific accuracy, reliability, and completeness during the preparation process for the Physical Medicine and Rehabilitation specialty board exams. Today, AI has begun to be widely used in many sectors, and its areas of application are rapidly expanding. This rise has also become inevitable in medical fields such as medicine and health sciences. Methods: In the study, 46 questions from the osteoarthritis chapter of the book “Robert Kaplan, Pearls of Wisdom, Second Edition” were used. These questions were directed to ChatGPT-4.0 and Gemini Advanced 2.0 Flash platforms. The responses provided were evaluated based on the reference answers in the source, using a 5-point Likert scale in terms of completeness, clarity, lack of false information (accuracy), and presence of evidence. Wilcoxon Signed-Rank test and paired t-test were used to compare the scores of the platforms, and the Friedman test was used to reveal within-group differences. Results: ChatGPT scored high in completeness (4.37±0.77), clarity (4.57±0.62), accuracy (4.78±0.63), and evidence presence (4.59±0.69). Gemini showed lower performance in completeness (3.72±0.96) and evidence presence (3.20±1.69). According to the Wilcoxon test results, ChatGPT was found to be statistically significantly superior in terms of completeness and evidence presence (p<0.001, effect size: 0.52 and 0.61). The overall evaluation score was 4.58±0.49 for ChatGPT and 4.01±0.66 for Gemini, with the difference being statistically significant (p<0.001, Cohen’s d=0.74). AI platforms showed the highest success in treatment-related responses for osteoarthritis, while the lowest success was observed in the general information category. According to Friedman test results, the best performance of both models was in lack of false information, while ChatGPT’s weakest category was completeness, and Gemini’s was evidence support. Conclusion: ChatGPT provided more consistent and comprehensive information in osteoarthritis-related exam preparation materials, whereas Gemini, despite showing similar performance in terms of accuracy, was weaker in terms of completeness and use of references. It was concluded that AI-supported study methods could serve as a supportive tool in board exam preparation; however, they should be carefully evaluated for completeness and evidence support. Since acquiring medical knowledge and learning is critical, responses generated by AI must be verified through reliable sources.

创建时间：

2025-03-15