Data Sheet 1_Evaluation of the accuracy and repeatability of Deepseek V3, Doubao, and Kimi1.5 in answering knowledge-related queries about chronic non-bacterial osteitis.zip

NIAID Data Ecosystem2026-05-10 收录

下载链接：

https://figshare.com/articles/dataset/Data_Sheet_1_Evaluation_of_the_accuracy_and_repeatability_of_Deepseek_V3_Doubao_and_Kimi1_5_in_answering_knowledge-related_queries_about_chronic_non-bacterial_osteitis_zip/30230416

下载链接

链接失效反馈

官方服务：

资源简介：

BackgroundThere are significant differences in the diagnosis and treatment of chronic non-bacterial osteitis (CNO), and there is an urgent need for health education efforts to enhance awareness of this condition. Deepseek V3, Doubao, and Kimi1.5 are highly popular language models in China that can provide knowledge related to diseases. This article aims to investigate the accuracy and reproducibility of the responses provided by these three artificial intelligence (AI) language models in answering questions about CNO. MethodsAccording to the latest expert consensus, 16 questions related to CNO were collected. The three AI language models were separately asked these questions at three different times. The answers were independently evaluated by two orthopedic experts. ResultsAmong the responses of the three AI models to 16 CNO-related questions across three rounds of testing, only Doubao received “Completely incorrect” ratings (accounting for 6.25%) in the third round of scoring by Reviewer 2. During the answering process, Doubao had the shortest response time and provided the most words in its answers. In the first and third rounds of scoring by the first expert, Kimi scored the highest (3.938 ± 0.342, 3.875 ± 0.873), while in the second round, Doubao scored the highest (3.875 ± 0.5). In the second round of scoring by the second expert, Doubao received the highest score (3.812 ± 0.403). In the first and third rounds, Kimi1.5 received the highest score (3.812 ± 0.602, 3.812 ± 0.704). ConclusionDeepseek V3, Doubao, and Kimi1.5 are capable of answering most questions related to CNO with good accuracy and reproducibility, showing no significant differences.

研究背景：慢性非细菌性骨炎（chronic non-bacterial osteitis, CNO）的诊疗方案存在显著差异，目前亟需开展健康教育工作以提升公众对该疾病的认知。Deepseek V3、豆包（Doubao）、Kimi1.5是国内当前极具热度的大语言模型（Large Language Model, LLM），可提供疾病相关知识。本研究旨在探究这三款AI大语言模型在回答慢性非细菌性骨炎相关问题时，其回复内容的准确性与可重复性。研究方法：基于最新专家共识，本研究收集了16个与慢性非细菌性骨炎相关的问题。我们于三个不同时间节点分别向这三款AI大语言模型提问上述问题，所有模型回复均由两名骨科专家独立评分。研究结果：在三轮测试中，三款AI模型针对16个慢性非细菌性骨炎相关问题的回复里，仅豆包在评审专家2的第三轮评分中获得了6.25%的“完全错误”评级。作答过程中，豆包的响应时长最短，且回复字数最多。在评审专家1的第一轮与第三轮评分中，Kimi得分最高（3.938±0.342、3.875±0.873）；而在第二轮评分中，豆包得分最高（3.875±0.5）。在评审专家2的第二轮评分中，豆包获得最高分（3.812±0.403）。在第一轮与第三轮评分中，Kimi1.5得分最高（3.812±0.602、3.812±0.704）。研究结论：Deepseek V3、豆包与Kimi1.5均可较为准确地回答多数慢性非细菌性骨炎相关问题，且具备良好的可重复性，三款模型间未表现出显著差异。

创建时间：

2025-09-29