Correct response rate according to figures.

NIAID Data Ecosystem2026-05-02 收录

下载链接：

https://figshare.com/articles/dataset/Correct_response_rate_according_to_figures_/29173296

下载链接

链接失效反馈

官方服务：

资源简介：

Introduction In this study, we aim to evaluate the ability of large language models (LLM) to generate questions and answers in oral and maxillofacial surgery. Methods ChatGPT4, ChatGPT4o, and Claude3-Opus were evaluated in this study. Each LLM was instructed to generate 50 questions about oral and maxillofacial surgery. Three LLMs were asked to answer the generated 150 questions. Results All 150 questions generated by the three LLMs were related to oral and maxillofacial surgery. Each model exhibited a correct answer rate of over 90%. None of the three models were able to answer correctly all the questions they generated themselves. The correct answer rate was 97.0% for questions with figures, significantly higher than the 88.9% rate for questions without figures. The analysis of problem-solving by the three LLMs showed that each model generally inferred answers with high accuracy, and there were few logical errors that could be considered controversial. Additionally, all three scored above 88% for the fidelity of their explanations. Conclusion This study demonstrates that while LLMs like ChatGPT4, ChatGPT4o, and Claude3-Opus exhibit robust capabilities in generating and solving oral and maxillofacial surgery questions, their performance is not without limitations. None of the models were able to answer correctly all the questions they generated themselves, highlighting persistent challenges such as AI hallucinations and contextual understanding gaps. The results also emphasize the importance of multimodal inputs, as questions with annotated images achieved higher accuracy rates compared to text-only prompts. Despite these shortcomings, the LLMs showed significant promise in problem-solving, logical consistency, and response fidelity, particularly in structured or numerical contexts.

引言本研究旨在评估大语言模型（Large Language Model，LLM）在口腔颌面外科领域生成问答内容的能力。研究方法本研究选取ChatGPT4、ChatGPT4o及Claude3-Opus三款大语言模型进行评估。要求每款模型生成50道口腔颌面外科相关题目，随后由这三款模型分别对共计150道生成题目进行作答。研究结果三款模型生成的全部150道题目均与口腔颌面外科领域相关。单模型答题正确率均超过90%，但无一模型能够完全答对自身生成的全部题目。带配图的题目正确率达97.0%，显著高于无配图题目88.9%的正确率。对三款模型的解题过程分析显示，各模型普遍能以较高准确率推导答案，仅存在少量具备争议性的逻辑错误。此外，三款模型的解释保真度得分均高于88%。结论本研究表明，尽管ChatGPT4、ChatGPT4o及Claude3-Opus等大语言模型在口腔颌面外科题目生成与解答方面展现出较强的能力，但其性能仍存在局限性。无一模型能够完全答对自身生成的全部题目，这凸显了AI幻觉与上下文理解缺口等仍需攻克的挑战。研究结果同时强调了多模态输入的重要性：带标注图像的题目相较于纯文本提示题，正确率更高。尽管存在上述不足，三款大语言模型在解题能力、逻辑一致性与响应保真度方面仍展现出显著潜力，尤其在结构化或数值化场景中表现突出。

创建时间：

2025-05-28

5,000+

优质数据集

54 个

任务类型

进入经典数据集