"Benchmark Dataset for Thyroid Eye Disease Patient Counseling Across Five Large Language Models"

Name: "Benchmark Dataset for Thyroid Eye Disease Patient Counseling Across Five Large Language Models"
Creator: IEEE DataPort
Published: 2026-03-02 08:45:05
License: 暂无描述

DataCite Commons2026-03-02 更新2026-05-03 收录

下载链接：

https://ieee-dataport.org/competitions/benchmark-dataset-thyroid-eye-disease-ted-patient-counseling-across-five-large

下载链接

链接失效反馈

官方服务：

资源简介：

"Thyroid eye disease (TED) requires timely risk stratification and triage, yet patients increasingly use large language model (LLM) chatbots for guidance. Comparative evidence on the safety and counseling quality of newer web-deployed LLMs in ophthalmic care remains limited. We performed a cross-sectional benchmarking study using a prespecified 35-item Chinese TED patient-counseling question bank and a standardized single-turn protocol to evaluate five publicly accessible LLM chatbot services (Gemini 3 Pro, ChatGPT-5.2, DeepSeek-V3.1, Doubao, and Qwen3-Max). All systems were accessed through official web interfaces in Quzhou, China, during 27\u201329 December 2025, and user-visible model identifiers\/settings were documented. Objective response features (response time, words, characters, paragraphs, sentences, and tables) were quantified, and two blinded experts rated outputs against a guideline-\/consensus-informed reference standard using 5-point Likert scales for Accuracy, Logic, Coherence, Safety, and Content Accessibility. Between-model comparisons and correlation analyses were conducted. Response time differed significantly (P<0.001): Gemini 3 Pro was fastest (32.52\u00b14.53 s) and Doubao slowest (63.33\u00b111.69 s). Output structure also varied substantially, with Doubao generating the longest responses, ChatGPT-5.2 the shortest, and Qwen3-Max the most table-formatted outputs. Significant between-model differences were observed for accuracy, logic, coherence, and content accessibility (all P\u22640.007), but not safety (P=0.828). Longer or slower responses did not consistently indicate higher clinical quality. These findings highlight substantial heterogeneity across contemporary LLMs for TED counseling and support risk-centered, structured response design and further validation in multi-turn, safety-focused ophthalmic triage workflows. "

提供机构：

IEEE DataPort

创建时间：

2026-03-02

5,000+

优质数据集

54 个

任务类型

进入经典数据集