Data Sheet 1_Improving reliability and accuracy of structured data extraction using a consensus large-language model approach–a use case description in multiple sclerosis.pdf

NIAID Data Ecosystem2026-05-10 收录

下载链接：

https://figshare.com/articles/dataset/Data_Sheet_1_Improving_reliability_and_accuracy_of_structured_data_extraction_using_a_consensus_large-language_model_approach_a_use_case_description_in_multiple_sclerosis_pdf/31330921

下载链接

链接失效反馈

官方服务：

资源简介：

BackgroundThe absence of standardization in the documentation of routine clinical data complicates research usage of retrospective data on a large-scale basis. Medically trained personnel is required for interpretation and conversion into a structured format making it time and cost intensive and creating a potential bias of such data. To address these challenges, we have developed a semi-automated approach for evaluating Multiple Sclerosis (MS) outpatients reports that utilizes different large-language models (LLM) and their consensus in comparison to manual evaluation. MethodsWe used several commercially available LLMs by OpenAI, Anthropic and Google to create a structured output of several variables with differing complexity of 30 anonymized outpatient reports with zero-shot-learning. We added a consensus output by combining the results of three different LLMs. Over several runs, we adapted the prompt, compared the results with a reference and assessed the error rate. Any deviation from the reference was considered an error. A true-error rate was determined for the LLM consensus output and the neurology specialist output, where only content deviations are counted as errors. ResultsThrough 9 iterations of improving the structure and content of the prompt, we have seen a clear reduction in the error rate of the various LLMs. By creating an LLM consensus with the final prompt design, we were able to overcome a ceiling effect in reducing the error rate. With a true-error rate of 1.48%, the LLM consensus shows a similar error rate as neurologists (around 2%) in the creation of structured data. DiscussionOur method enables fast and reliable LLM-based analysis of large clinical routine data sets of varying complexity with a low technical barrier to entry. By generating an LLM consensus, we were able to considerably improve the quality of the output making it comparable to data created by neurology specialists. This approach allows large amounts of unstructured data to be analyzed in a time and cost-efficient manner. Nevertheless, the evaluation of errors in results produced by LLM remains difficult. Scientific work using such methods must continue to be subject to strict testing of the validity of the method in the future.

背景常规临床数据文档缺乏标准化，极大阻碍了回顾性数据的大规模研究应用。此类数据的解读与结构化转换需依赖医学专业人员，不仅耗时耗力，还为其引入潜在偏倚。为解决上述问题，本研究开发了一种半自动化方法，用于评估多发性硬化（Multiple Sclerosis, MS）门诊病历报告，该方法借助多种大语言模型（Large Language Model, LLM）及其结果共识，并与人工评估进行对比。方法本研究采用OpenAI、Anthropic及Google推出的多款商用大语言模型，通过零样本学习处理30份已完成匿名化处理的门诊病历报告，生成不同复杂度的多变量结构化输出。我们通过融合三款不同大语言模型的输出结果生成共识输出。经多轮迭代优化提示词，将模型输出与参考标准进行比对并计算错误率，凡与参考标准存在偏差者均判定为错误。针对大语言模型共识输出与神经科专科医生输出，我们分别计算了真实错误率，仅将内容层面的偏差计入错误范畴。结果经9轮对提示词结构与内容的优化迭代，各类大语言模型的错误率均出现显著降低。结合最终优化后的提示词设计构建大语言模型共识输出，我们成功突破了错误率降低的天花板效应。该共识输出的真实错误率仅为1.48%，其结构化数据生成任务的错误率与神经科专科医生（约2%）相当。讨论本研究方法可对复杂度各异的大规模临床常规数据集开展快速可靠的大语言模型分析，且技术准入门槛较低。通过构建大语言模型共识输出，我们大幅提升了输出质量，使其可与神经科专科医生生成的数据相媲美。该方法能够以兼具时间与成本效益的方式处理海量非结构化数据。但针对大语言模型生成结果的错误评估仍存在挑战，未来采用此类方法开展的科研工作仍需对方法有效性进行严格验证。

创建时间：

2026-02-13

5,000+

优质数据集

54 个

任务类型

进入经典数据集