Development and validation of a GPT-based rater for assessing communication skills using the Gap-Kalamazoo Communication Skills Assessment Form

Name: Development and validation of a GPT-based rater for assessing communication skills using the Gap-Kalamazoo Communication Skills Assessment Form
Creator: Hsieh, Ching-Lin; Lee, Shih-Chieh; Ju, Yu-Jeng; Liu, Cheng‐Heng; Lee, Meng-Lin; Hou, Chieh-Yi; Yang, Chih-Wei; Wang, Yi-Ching
Published: 2025-08-23 00:00:00
License: 暂无描述

Taylor & Francis Group2025-08-23 更新2026-04-16 收录

下载链接：

https://tandf.figshare.com/articles/dataset/Development_and_validation_of_a_GPT-based_rater_for_assessing_communication_skills_using_the_Gap-Kalamazoo_Communication_Skills_Assessment_Form/29649817/1

下载链接

链接失效反馈

官方服务：

资源简介：

This study developed a generative pre-trained transformer (GPT)-based rater to assess communication skills using the Gap-Kalamazoo Communication Skills Assessment Form (GKCSAF), and examined its inter-rater reliability and concurrent validity. The GPT rater assessed 80 therapist-patient interaction transcripts previously assessed by human raters. For inter-rater reliability, at the total-score level, the GPT rater’s assessments showed acceptable differences (mean absolute error % [MAE%] = 12.2%–21.0%). However, we found low intraclass correlation coefficients (ICC) with human ratings (0.00–0.35), which might be due to limited score variability. At the domain level, only four domains showed acceptable differences (MAE% ≤ 30.3%) but all nine domains showed poor agreements (weighted κ ≤ 0.38). For concurrent validity, the GPT rater’s assessments also showed acceptable differences, but low ICC values compared to average human scores at both the total-score level (MAE% = 10.8%–11.5%; ICC = 0.12–0.36) and domain level (MAE% = 14.0%–30.3%; ICC = 0.00–0.37). Overall, the GPT rater may serve as a supplementary tool for providing total scores in low-stakes assessments of communication skills. Its performance at the domain level appears limited, highlighting the need for caution in domain interpretation and the importance of further refinement for high-stakes or detailed assessment contexts.

本研究开发了一款基于生成式预训练Transformer（GPT）的评分器，采用盖普-卡拉马祖沟通技能评估量表（GKCSAF）对沟通技能进行评估，并检验了其评分者间信度与同时效度。该GPT评分器对80份此前已由人类评分者完成评估的治疗师-患者互动转录文本进行了评分。在评分者间信度方面，总分层面上，GPT评分器的评估结果显示可接受的误差水平（平均绝对误差百分比[MAE%]为12.2%~21.0%）。但研究发现其与人类评分的组内相关系数（ICC）较低（0.00~0.35），这可能源于评分变异性有限。在维度层面，仅4个维度呈现可接受的误差水平（MAE% ≤ 30.3%），但9个维度的评分一致性均较差（加权κ系数 ≤ 0.38）。在同时效度检验中，GPT评分器的评估结果同样呈现可接受的误差水平，但无论是总分层面（平均绝对误差百分比[MAE%]为10.8%~11.5%；组内相关系数[ICC]为0.12~0.36）还是维度层面（平均绝对误差百分比[MAE%]为14.0%~30.3%；组内相关系数[ICC]为0.00~0.37），其与人类平均评分的组内相关系数均较低。总体而言，该GPT评分器可作为补充工具，用于沟通技能低风险评估中的总分生成。其在维度层面的表现存在局限，提示在维度解读时需谨慎，同时也凸显了针对高风险或精细化评估场景进行进一步优化的必要性。

提供机构：

Hsieh, Ching-Lin; Lee, Shih-Chieh; Ju, Yu-Jeng; Liu, Cheng‐Heng; Lee, Meng-Lin; Hou, Chieh-Yi; Yang, Chih-Wei; Wang, Yi-Ching

创建时间：

2025-07-26