Supplementary file 1_Evaluating the quality of large language model-generated preoperative patient education material: a comparative study across models and surgery types.docx
收藏NIAID Data Ecosystem2026-05-10 收录
下载链接:
https://figshare.com/articles/dataset/Supplementary_file_1_Evaluating_the_quality_of_large_language_model-generated_preoperative_patient_education_material_a_comparative_study_across_models_and_surgery_types_docx/30856625
下载链接
链接失效反馈官方服务:
资源简介:
BackgroundNumerous studies have confirmed the effectiveness of large language models (LLMs) as a patient education tool; however, these studies primarily relied on the method of asking medical questions. So far, no studies have comprehensively assessed the quality of the complete preoperative patient education material (PEM) generated by LLMs from the perspectives of different models and surgical types.
ObjectiveThis study aims to comprehensively assess and compare the quality of different types of complete preoperative PEM generated by six common LLMs.
DesignA Cross-sectional Comparative Study.
MethodsWe prompted 6 LLMs to generate preoperative PEMs for 6 distinct surgical types. For each surgical type, the materials were evaluated by 3 groups of experts from relevant fields using a 5-point scale for their accuracy and completeness. Two researchers assessed the materials for understandability and actionability using the PEMAT-P, and for suitability using SAM. We also analyzed the materials for readability with Flesch-Kincaid and for sentiment with the VADER sentiment analysis tool. Statistical analysis was performed using the Friedman test, followed by Conover’s post-hoc test with Bonferroni correction.
ResultsThe research results show that each model has its strengths in different dimensions. All the models demonstrated excellent accuracy, understandability, and actionability with no statistically significant differences. In terms of completeness, Grok-4 and Claude-Opus-4 significantly outperformed GPT-4o. For suitability, Claude-Opus-4 performed the best, while Grok-4 was the worst. For readability, Grok-4 and Gemini-2.5-Pro were the easiest to understand, while Claude-Opus-4 had the lowest readability. Moreover, only Gemini-2.5-Pro could consistently generate content with positive emotions.
ConclusionThe research has found that the materials generated by these models can achieve high levels in multiple dimensions, but there is no perfect model. These models can be used by medical staff to generate the initial draft of preoperative PEMs. However, before providing them to the patients, they still need to be reviewed and supplemented by the medical staff.
创建时间:
2025-12-11



