EHR-DS-QA: A Synthetic QA Dataset Derived from Medical Discharge Summaries for Enhanced Medical Information Retrieval Systems

Name: EHR-DS-QA: A Synthetic QA Dataset Derived from Medical Discharge Summaries for Enhanced Medical Information Retrieval Systems
Creator: PhysioNet
Published: 2024-01-11 20:52:34
License: 暂无描述

DataCite Commons2024-01-11 更新2024-07-13 收录

下载链接：

https://physionet.org/content/ehr-ds-qa/

下载链接

链接失效反馈

官方服务：

资源简介：

This dataset was designed and created to enable advancements in healthcare- focused large language models, particularly in the context of retrieval- augmented clinical question-answering capabilities. Developed using a self- constructed pipeline based on the 13-billion parameter Meta Llama 2 model, this dataset encompasses 21466 medical discharge summaries extracted from the MIMIC-IV-Note dataset, with 156599 synthetically generated question-and-answer pairs, a subset of which was verified for accuracy by a physician. These pairs were generated by providing the model with a discharge summary and instructing it to generate question-and-answer pairs based on the contextual information present in the summaries. This work aims to generate data in support of the development of compact large language models capable of efficiently extracting information from medical notes and discharge summaries, thus enabling potential improvements for real-time decision-making processes in clinical settings. Additionally, accompanying the dataset is code facilitating question-and-answer pair generation from any medical and non-medical text. Despite the robustness of the presented dataset, it has certain limitations. The generation process was confined to a maximum context length of 6000 input tokens, owing to hardware constraints. The large language model's nature in generating these question-and-answer pairs may introduce an underlying bias or a lack in diversity and complexity. Future iterations should focus on rectifying these issues, possibly through diversified training and expanded verification procedures as well as the employment of more powerful large language models.

本数据集专为面向医疗领域的大语言模型（Large Language Model）的技术进阶而设计构建，尤其聚焦于检索增强型临床问答能力方向。本数据集基于参数规模达130亿的Meta Llama 2模型，通过自主搭建的流水线开发完成。其包含从MIMIC-IV-Note数据集中提取的21466份医疗出院小结，以及156599条人工合成生成的问答对，其中部分问答对已由医师完成准确性核验。上述问答对的生成方式为：向大语言模型提供单份出院小结，并指令其基于小结内的上下文信息生成问答对。本研究旨在生成支撑轻量化大语言模型研发的数据，这类模型可高效从医疗病历与出院小结中提取信息，进而助力临床场景下实时决策流程的优化升级。此外，本数据集还附带可从任意医疗及非医疗文本中生成问答对的配套代码。尽管本数据集具备一定鲁棒性，但仍存在若干局限。受硬件条件限制，生成过程的最大输入上下文长度被限定为6000个Token。此外，大语言模型生成问答对的固有特性可能引入潜在偏差，或是导致生成内容缺乏多样性与复杂度。未来的迭代优化可围绕修正上述问题展开，例如通过采用多样化训练、扩充核验流程，以及搭载性能更强的大语言模型来实现。

提供机构：

PhysioNet

创建时间：

2023-12-20

搜集汇总

背景与挑战

背景概述

EHR-DS-QA是一个合成的医疗问答数据集，基于MIMIC-IV-Note中的21,466份出院摘要，使用Meta Llama 2模型生成了156,599个问答对，部分经医生验证，旨在支持检索增强的临床问答系统开发，以提升医疗信息检索效率和实时决策能力。数据集附有代码，可从任意文本生成问答对，但存在上下文长度限制和潜在偏见等局限性。

以上内容由遇见数据集搜集并总结生成