EHR-DS-QA: A Synthetic QA Dataset Derived from Medical Discharge Summaries for Enhanced Medical Information Retrieval Systems

Name: EHR-DS-QA: A Synthetic QA Dataset Derived from Medical Discharge Summaries for Enhanced Medical Information Retrieval Systems
Creator: physionet.org
License: 暂无描述

physionet.org2025-03-22 收录

下载链接：

https://physionet.org/content/ehr-ds-qa/1.0.0/

下载链接

链接失效反馈

官方服务：

资源简介：

This dataset was designed and created to enable advancements in healthcare-focused large language models, particularly in the context of retrieval-augmented clinical question-answering capabilities. Developed using a self-constructed pipeline based on the 13-billion parameter Meta Llama 2 model, this dataset encompasses 21466 medical discharge summaries extracted from the MIMIC-IV-Note dataset, with 156599 synthetically generated question-and-answer pairs, a subset of which was verified for accuracy by a physician. These pairs were generated by providing the model with a discharge summary and instructing it to generate question-and-answer pairs based on the contextual information present in the summaries. This work aims to generate data in support of the development of compact large language models capable of efficiently extracting information from medical notes and discharge summaries, thus enabling potential improvements for real-time decision-making processes in clinical settings. Additionally, accompanying the dataset is code facilitating question-and-answer pair generation from any medical and non-medical text. Despite the robustness of the presented dataset, it has certain limitations. The generation process was confined to a maximum context length of 6000 input tokens, owing to hardware constraints. The large language model's nature in generating these question-and-answer pairs may introduce an underlying bias or a lack in diversity and complexity. Future iterations should focus on rectifying these issues, possibly through diversified training and expanded verification procedures as well as the employment of more powerful large language models.

本数据集之设计旨在促进以医疗保健为焦点的庞大语言模型之进步，尤其是在检索增强的临床问题回答能力方面。该数据集采用基于1300亿参数的Meta Llama 2模型所构建的自定义管道开发而成，包含从MIMIC-IV-Note数据集中提取的21466份医疗出院总结，以及156599对由医生验证过的合成问答对。这些问答对通过向模型提供出院总结并指令其基于总结中的上下文信息生成问答对而生成。本项工作旨在生成支持紧凑型大语言模型开发的数据，这些模型能够高效地从医疗记录和出院总结中提取信息，从而为临床环境中的实时决策过程带来潜在改进。此外，伴随本数据集的是代码，该代码能够从任何医疗和非医疗文本中生成问答对。尽管所呈现的数据集具有稳健性，但仍存在某些局限性。由于硬件限制，生成过程被限制在最大6000个输入token的上下文长度内。大型语言模型生成这些问答对的本质可能引入潜在的偏差，或导致多样性和复杂性不足。未来的迭代应着重解决这些问题，可能通过多样化的训练和扩展验证程序，以及使用更强大的大型语言模型来实现。

提供机构：

physionet.org