EHR-DS-QA: A Synthetic QA Dataset Derived from Medical Discharge Summaries for Enhanced Medical Information Retrieval Systems
收藏DataCite Commons2024-09-27 更新2024-07-13 收录
下载链接:
https://physionet.org/content/ehr-ds-qa/1.0.0/
下载链接
链接失效反馈官方服务:
资源简介:
This dataset was designed and created to enable advancements in healthcare-
focused large language models, particularly in the context of retrieval-
augmented clinical question-answering capabilities. Developed using a self-
constructed pipeline based on the 13-billion parameter Meta Llama 2 model,
this dataset encompasses 21466 medical discharge summaries extracted from the
MIMIC-IV-Note dataset, with 156599 synthetically generated question-and-answer
pairs, a subset of which was verified for accuracy by a physician. These pairs
were generated by providing the model with a discharge summary and instructing
it to generate question-and-answer pairs based on the contextual information
present in the summaries. This work aims to generate data in support of the
development of compact large language models capable of efficiently extracting
information from medical notes and discharge summaries, thus enabling
potential improvements for real-time decision-making processes in clinical
settings. Additionally, accompanying the dataset is code facilitating
question-and-answer pair generation from any medical and non-medical text.
Despite the robustness of the presented dataset, it has certain limitations.
The generation process was confined to a maximum context length of 6000 input
tokens, owing to hardware constraints. The large language model's nature in
generating these question-and-answer pairs may introduce an underlying bias or
a lack in diversity and complexity. Future iterations should focus on
rectifying these issues, possibly through diversified training and expanded
verification procedures as well as the employment of more powerful large
language models.
提供机构:
PhysioNet
创建时间:
2023-12-20
搜集汇总
数据集介绍

背景与挑战
背景概述
EHR-DS-QA是一个基于医疗出院摘要的合成问答数据集,旨在增强医疗信息检索系统。它包含来自MIMIC-IV-Note数据集的21466份出院摘要,并生成了156599个问答对,其中部分经过医生验证,准确率超过94%。数据集支持JSON和CSV格式,适用于训练和评估医疗领域的问答模型。
以上内容由遇见数据集搜集并总结生成



