five

EHR-DS-QA: A Synthetic QA Dataset Derived from Medical Discharge Summaries for Enhanced Medical Information Retrieval Systems

收藏
DataCite Commons2024-09-27 更新2024-07-13 收录
下载链接:
https://physionet.org/content/ehr-ds-qa/1.0.0/
下载链接
链接失效反馈
官方服务:
资源简介:
This dataset was designed and created to enable advancements in healthcare- focused large language models, particularly in the context of retrieval- augmented clinical question-answering capabilities. Developed using a self- constructed pipeline based on the 13-billion parameter Meta Llama 2 model, this dataset encompasses 21466 medical discharge summaries extracted from the MIMIC-IV-Note dataset, with 156599 synthetically generated question-and-answer pairs, a subset of which was verified for accuracy by a physician. These pairs were generated by providing the model with a discharge summary and instructing it to generate question-and-answer pairs based on the contextual information present in the summaries. This work aims to generate data in support of the development of compact large language models capable of efficiently extracting information from medical notes and discharge summaries, thus enabling potential improvements for real-time decision-making processes in clinical settings. Additionally, accompanying the dataset is code facilitating question-and-answer pair generation from any medical and non-medical text. Despite the robustness of the presented dataset, it has certain limitations. The generation process was confined to a maximum context length of 6000 input tokens, owing to hardware constraints. The large language model's nature in generating these question-and-answer pairs may introduce an underlying bias or a lack in diversity and complexity. Future iterations should focus on rectifying these issues, possibly through diversified training and expanded verification procedures as well as the employment of more powerful large language models.
提供机构:
PhysioNet
创建时间:
2023-12-20
搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
EHR-DS-QA是一个基于医疗出院摘要的合成问答数据集,旨在增强医疗信息检索系统。它包含来自MIMIC-IV-Note数据集的21466份出院摘要,并生成了156599个问答对,其中部分经过医生验证,准确率超过94%。数据集支持JSON和CSV格式,适用于训练和评估医疗领域的问答模型。
以上内容由遇见数据集搜集并总结生成
二维码
社区交流群
二维码
科研交流群
商业服务