Dataset for Comparative evaluation and performance of large language models inclinical infection control scenarios: a benchmark study
收藏Figshare2025-09-17 更新2026-04-08 收录
下载链接:
https://figshare.com/articles/dataset/Dataset_for_Comparative_evaluation_and_performance_of_large_language_models_inclinical_infection_control_scenarios_a_benchmark_study/30149236/1
下载链接
链接失效反馈官方服务:
资源简介:
This cross-sectional benchmarking study evaluated three large language models (LLMs)—GPT-4.1, DeepSeek V3, and Gemini 2.5 Pro Experimental—for supporting infection control nurses (ICNs) in clinical infection prevention and control (IPC) consultations. Using 30 real hospital scenarios from Queen Mary Hospital (Hong Kong), each LLM first generated clarifying questions, then produced recommendations via two prompting methods: open-ended and a structured template. Sixteen experts (ICNs and physicians) rated outputs on coherence, conciseness, usefulness/relevance, evidence quality, and actionability (1–10). GPT-4.1 and DeepSeek V3 outperformed Gemini on composite quality (36.77 ± 7.53; 36.25 ± 8.02 vs. 33.22 ± 7.92; p < 0.001). GPT-4.1 led in evidence quality. Task time was similar across models (≈2–3 minutes). Gemini failed to generate responses in 50% of scenarios, likely due to context-length limits. Structured prompting yielded small but significant improvements overall, driven mainly by better evidence quality, with variable gains across models. Despite acceptable scores, qualitative review identified critical safety issues in all models, including flawed clinical judgment (e.g., initiating TB treatment based solely on AFB smear; overly aggressive measles isolation) and impractical or policy-inconsistent advice (e.g., MDRA isolation). Doctors scored outputs higher than nurses; senior doctors scored highest. Conclusion: GPT-4.1 and DeepSeek V3 can assist but are not reliable for autonomous IPC decision-making. LLMs should augment, not replace, ICNs’ expertise.
提供机构:
Chiu, Kwan Yeung Edwin
创建时间:
2025-09-17



