Supplementary Material for: ChatGPT-4o in risk of bias assessments in neonatology – a validity analysis

Name: Supplementary Material for: ChatGPT-4o in risk of bias assessments in neonatology – a validity analysis
Creator: Karger Publishers
Published: 2025-02-25 13:29:30
License: 暂无描述

DataCite Commons2025-02-25 更新2025-05-07 收录

下载链接：

https://karger.figshare.com/articles/dataset/Supplementary_Material_for_ChatGPT-4o_in_risk_of_bias_assessments_in_neonatology_a_validity_analysis/28485194/1

下载链接

链接失效反馈

官方服务：

资源简介：

Background: Only a few studies have addressed the potential of large language models (LLM) in risk of bias assessments and the results have been varying. The aim of this study was to analyze how well ChatGPT performs in risk of bias assessments of neonatal studies. Methods: We searched all Cochrane neonatal intervention reviews published in 2024 and extracted all risk of bias assessments. Then the full reports were retrieved and uploaded alongside the guidance to perform a Cochrane original risk of bias analysis in ChatGPT-4o. The concordance between the original assessment and that provided by ChatGPT-4o was evaluated by inter-class correlation coefficients and Cohen’s Kappa statistics (with 95% confidence intervals for each risk of bias domain and for the overall assessment. Results: From nine reviews a total of 61 randomized studies were analyzed. A total of 427 judgements were compared. The overall kappa was 0.43 (95%CI 0.35-0.51) and the overall intraclass correlation coefficient was 0.65 (95%CI: 0.59-0.70). The Cohen’s kappa was assessed for each domain and the best agreement was observed in the allocation concealment (kappa=0.73, 95%CI: 0.55-0.90), whereas the poorest agreement was found in incomplete outcome data (kappa=-0.03, 95%CI: -0.07-0.02). Conclusion: ChatGPT-4o failed to achieve sufficient agreement in the risk of bias assessments. Future studies should examine whether the performance of other LLM would be better or whether the agreement in ChatGPT-4o could be further enhanced by better prompting. Currently the use of ChatGPT-4o in risk of bias assessments should not be promoted.

背景：目前仅有少量研究探讨了大语言模型（Large Language Model，LLM）在偏倚风险评估中的应用潜力，且相关研究结果存在较大差异。本研究旨在分析ChatGPT在新生儿研究偏倚风险评估中的表现水平。方法：我们检索了2024年发表的所有考科蓝（Cochrane）新生儿干预系统评价，并提取其中全部偏倚风险评估内容。随后获取完整研究报告，结合考科蓝原始偏倚风险分析指南上传至ChatGPT-4o中开展分析。采用组间相关系数与科恩Kappa统计量，评估原始人工评估与ChatGPT-4o生成评估之间的一致性，并计算各偏倚风险领域及整体评估的95%置信区间。结果：本研究共纳入9篇系统评价，分析总计61项随机研究，对比了427条偏倚风险判断结果。整体科恩Kappa值为0.43（95%CI 0.35-0.51），整体组内相关系数为0.65（95%CI: 0.59-0.70）。针对各偏倚风险领域分别计算科恩Kappa值后发现，分配隐藏领域的一致性最佳（kappa=0.73，95%CI: 0.55-0.90），而结局数据不完整领域的一致性最差（kappa=-0.03，95%CI: -0.07-0.02）。结论：ChatGPT-4o在偏倚风险评估中未达到足够的一致性水平。未来研究应探讨其他大语言模型的表现是否更优，或是通过优化提示策略能否进一步提升ChatGPT-4o的评估一致性。目前不应推广ChatGPT-4o在偏倚风险评估中的应用。

提供机构：

Karger Publishers

创建时间：

2025-02-25