AGBonnet/augmented-clinical-notes
收藏Hugging Face2024-01-24 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/AGBonnet/augmented-clinical-notes
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
task_categories:
- text-generation
language:
- en
pretty_name: Augmented Clinical Notes
size_categories:
- 10K<n<100K
configs:
- config_name: default
data_files:
- split: train
path: augmented_notes_30K.jsonl
tags:
- medical
- health
dataset_info:
features:
- name: idx
dtype: string
- name: note
dtype: string
- name: full_note
dtype: string
- name: conversation
dtype: string
- name: summary
dtype: string
---
# Augmented Clinical Notes
The Augmented Clinical Notes dataset is an extension of existing datasets containing 30,000 triplets from different sources:
- **Real clinical notes** (*[PMC-Patients](https://arxiv.org/abs/2202.13876)*): Clinical notes correspond to patient summaries from the PMC-Patients dataset, which are extracted from PubMed Central case studies.
- **Synthetic dialogues** (*[NoteChat](https://arxiv.org/abs/2310.15959)*): Synthetic patient-doctor conversations were generated from clinical notes using GPT 3.5.
- **Structured patient information** (*ours*): From clinical notes, we generate structured patient summaries using GPT-4 and a tailored medical information template (see details below).
This dataset was used to train [**MediNote-7B**](https://huggingface.co/AGBonnet/medinote-7b) and [**MediNote-13B**](https://huggingface.co/AGBonnet/medinote-13b), a set of clinical note generators fine-tuned from the [**MediTron**](https://huggingface.co/epfl-llm/meditron-7b) large language models.
Our full report is available [here](./report.pdf).
## Dataset Details
<!-- Provide a longer summary of what this dataset is. -->
- **Curated by:** Antoine Bonnet and Paul Boulenger
- **Language(s):** English only
- **Repository:** [EPFL-IC-Make-Team/ClinicalNotes](https://github.com/EPFL-IC-Make-Team/ClinicalNotes)
- **Paper:** *[MediNote: Automated Clinical Notes](report.pdf)*
## Dataset Creation
**Clinical notes**. Our primary source of clinical notes is *[PMC-Patients](https://arxiv.org/abs/2202.13876)*. This large-scale dataset contains 167K patient summaries extracted from open-access case studies published in PubMed Central. Each note encapsulates a detailed case presentation as written by a doctor, presenting a thorough summary encompassing the patient’s visit, medical history, symptoms, administered treatments, as well as the discharge summary and outcome of the intervention. These comprehensive case presentations offer a rich and diverse collection of medical scenarios, forming a robust foundation for our model training and evaluation.
**Synthetic dialogues**. Distribution of confidential patient-doctor conversations is forbidden, so no large scale dataset is publicly available for training. We circumvent the lack of real dialogue data by building upon [NoteChat](https://huggingface.co/datasets/akemiH/NoteChat), an extension of PMC-Patients with 167K synthetic patient-doctor conversations. Each dialogue transcript within the NoteChat dataset was generated from a clinical note by ChatGPT (version `gpt-3.5-turbo-0613`).
**Patient information**. We augment the PMC-Patients and NoteChat datasets by extracting structured patient information from the 30K longest clinical notes. To do so, we prompt GPT-4 (version `gpt-4-turbo-0613`) with zero-shot instructions, providing clinical notes and a structured template of patient medical information with feature definitions. This template, shown below, encapsulates crucial aspects of a clinical note such as the patient’s admission to a care center, medical history, current symptoms, as well as the doctor’s diagnosis and treatment plan.
The full data pipeline is shown below.
<p align="center">
<img width=70% src="data_pipeline.pdf" alt="Data pipeline" title="Data pipeline">
</p>
### Medical information template
Here is shown the medical template we used to structurize clinical notes. A JSON version is also available as `template_definitions.json`.
<p align="center">
<img width=70% src="template.pdf" alt="Data pipeline" title="Data pipeline">
</p>
### Dialogue Quality
The primary aim of synthetic dialogues is to distill comprehensive information from the case presentation, transforming it into a plausible and engaging conversation.
Newer versions of the dataset include higher quality dialogues generated by GPT-4 and NoteChat, a multi-agent dialogue generation pipeline (see the [NoteChat repository](https://github.com/believewhat/Dr.NoteAid) for more information).
Dialogues produced by ChatGPT tend to lack realism and frequently adhere to a pattern where the doctor poses a series of questions mirroring the facts from the original clinical notes, receiving simple ’Yes’ responses from the patient. Nevertheless, we decided to use ChatGPT dialogues as they were the only ones available during the training phase.
Clinical notes within NoteChat were truncated prior to the dialogue generation process. Consequently, the information lost due to truncation from the clinical note is also missing in the resulting dialogue. While complete notes were accessible from PMC-Patients, a conscious decision was made to fine-tune our models using truncated notes. This decision aimed at preventing our fine-tuned models from being inadvertently trained to hallucinate information towards the conclusion of a note. Notably, certain ChatGPT dialogues involving scenarios where a patient passes away and a subsequent dialogue with a family member commences revealed instances of prompt leaks. These leaks manifested as the prompt used for synthetic dialogue generation being inadvertently repeated within the dialogue.
## Dataset Structure
<!-- This section provides a description of the dataset fields, and additional information about the dataset structure such as criteria used to create the splits, relationships between data points, etc. -->
Each row of the dataset represents one dialogue-summary-note triplet, and consists of the following dataset fields (all strings):
| Field | Description | Source |
|-|-|-|
| `idx` | Unique identifier, index in the original NoteChat-ChatGPT dataset | NoteChat |
| `note` | Clinical note used by NoteChat (possibly truncated) | NoteChat |
| `full_note` | Full clinical note | PMC-Patients |
| `conversation` | Patient-doctor dialogue | NoteChat |
| `summary`| Patient information summary (JSON) | ours |
## Uses
<!-- Address questions around how the dataset is intended to be used. -->
While this dataset was originally used to fine-tune LLMs to extract structured patient information from dialogue, it can also be used for diverse applications in the healthcare domain, such as training models to extract comprehensive tabular patient features from clinical notes.
## Bias, Risks, and Limitations
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
- **Synthetic Data**: NoteChat dialogues were synthetically generated from clinical notes; they are not completely realistic and therefore fail to accurately represent real patient-doctor conversations. Real patient-doctor conversations are of course preferred, but their distribution is forbidden in the US by the [Health Insurance Portability and Accountability Act of 1996](https://www.cdc.gov/phlp/publications/topic/hipaa.html).
- **Representation**: PMC-Patients clinical notes have been extracted from English PubMed Central publications, and therefore over-represent clinical settings from English-speaking countries.
## Acknowledgments
We thank Prof. Mary-Anne Hartley for her advice on the appropriate template for structured medical patient summaries.
<!--
## Citation
If you use the Augmented Clinical Notes dataset, please cite out work:
```
ADD CITATION
```
--!>
提供机构:
AGBonnet
原始信息汇总
增强临床笔记数据集
数据集概述
- 名称: 增强临床笔记数据集(Augmented Clinical Notes)
- 许可: MIT
- 任务类别: 文本生成
- 语言: 英语
- 数据规模: 10K<n<100K
- 配置:
- 默认配置: 包含30,000条数据,文件路径为
augmented_notes_30K.jsonl
- 默认配置: 包含30,000条数据,文件路径为
- 标签: 医疗、健康
数据集详情
- 特征:
idx: 字符串,唯一标识符note: 字符串,NoteChat使用的临床笔记(可能被截断)full_note: 字符串,完整的临床笔记conversation: 字符串,患者-医生对话summary: 字符串,患者信息摘要(JSON格式)
数据集来源
- 临床笔记: 来自PMC-Patients,包含167K条患者总结,从PubMed Central的开放获取病例研究中提取。
- 合成对话: 使用GPT 3.5从临床笔记生成的合成患者-医生对话,基于NoteChat。
- 结构化患者信息: 从临床笔记中提取的结构化患者总结,使用GPT-4和定制的医疗信息模板生成。
数据集创建
- 临床笔记: 主要来源是PMC-Patients。
- 合成对话: 基于NoteChat,使用ChatGPT生成。
- 患者信息: 从30K条最长临床笔记中提取结构化患者信息,使用GPT-4和零样本指令。
数据集结构
- 字段:
idx: 唯一标识符note: NoteChat使用的临床笔记full_note: 完整的临床笔记conversation: 患者-医生对话summary: 患者信息摘要(JSON格式)
数据集用途
- 用于微调大型语言模型(LLMs)从对话中提取结构化患者信息,也可用于医疗领域的其他应用,如从临床笔记中提取综合表格患者特征。
偏差、风险和限制
- 合成数据: NoteChat对话是合成生成的,不完全真实,无法准确代表真实的患者-医生对话。
- 代表性: PMC-Patients临床笔记来自英语PubMed Central出版物,因此过度代表英语国家的临床环境。



