abuhoraira06/Open-Patients
收藏Hugging Face2026-04-12 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/abuhoraira06/Open-Patients
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-sa-4.0
tags:
- medical
---
Open-Patients is an aggregated dataset of public patient notes from four open-source datasets of public patient notes.
There are a total of 180,142 patient descriptions from these four datasets. These descriptions are all provided in the `Open-Patients.jsonl` file. For each item in the dataset, there are two attributes:
1. `_id` - tells which dataset did an item come from along with the index number of the item from the dataset.
2. `description` - the exact patient note extracted from a public dataset of patient notes
The patient notes and questions come from the following four datasets:
1. `Text REtrieval Conference (TREC) Clinical Decision Support (CDS) track`. This track consists of datasets of 30 patient notes each for
three separate years from 2014-2016. The motivation of this track was to challenge participants to obtain relevant articles that
can help answer potential questions for a particular patient note. The patient notes [2014](https://www.trec-cds.org/2014.html) and [2015](https://www.trec-cds.org/2014.html) are synthetic patient notes hand-written
by individuals with medical training, but the [2016](https://www.trec-cds.org/2016.html) dataset consists of real patient summaries coming from electronic health records.
The `_id` for these notes is specified by the following structure: trec-cds-{year}-{note number}, where year is between 2014 and 2016,
and the 'note number' is the index number of the note from the dataset for a particular year.
2. `Text REtrieval Conference (TREC) Clinical Trials (CT) track`. This track consists of 125 patient notes, where [50 notes](https://www.trec-cds.org/2021.html) are from the
year of 2021 and [75 notes](https://www.trec-cds.org/2022.html) are from the year of 2022. This track was meant to have participants retrieve previous clinical trials from
ClinicalTrials.gov that best match the symptoms described in the patient note. The notes from both tracks are synthetic notes written by individuals with medical training
meant to simulate an admission statement from an electronic health record (EHR). The `_id` for these notes is specified by the following
structure: trec-ct-{year}-{note number}, where year is either 2021 or 2022, and the 'note number' is the index number of the note from the
dataset for a particular year.
3. `MedQA-USMLE (US track) track`. This [dataset](https://paperswithcode.com/dataset/medqa-usmle) consists of 14,369 multiple-choice questions from the United States Medical Liscensing Examination (USMLE)
where a clinical summary of a patient is given and a question is asked based on the information provided. Because not all of the questions involve a patient case, we filter for the ones
involving patients and so there are 12,893 questions used from this dataset. These questions were curated as part of the MedQA dataset for examining retrieval methods for extracting relevant documents and
augmenting them with language models to help solve a question. The `_id` for these notes are specified with the following format: usmle-{question index number}, where 'question index number' is the index of the question
from the US_qbank.jsonl file in the MedQA dataset, consisting of all USMLE questions.
5. `PMC-Patients`. This [dataset](https://pmc-patients.github.io/) consists of 167,034 patient notes that were curated from PubMed Central (PMC). The purpose of this dataset is to
benchmark the performance different Retrieval-based Clinical Decision Support Systems (ReCDS). For a given patient note, this dataset evaluates a model's
ability to find similar patient notes and relevant articles from PMC. The `_id` for these notes are specified with the following format: pmc-{patient id},
where the 'patient id' is the 'patient_uid' attribute for each of the the patient notes from the `pmc-patients.json` file in the PMC-Patient dataset.
We hope this data set of patient summaries and medical examination questions can be helpful for researchers looking to benchmark the performance
of large language models (LLMs) on medical entity extraction and also benchmark LLM's performance in using these extracted entitites
to perform different medical calculations.
If you find this dataset useful, please cite our paper by:
```bibtex
@article{khandekar2024medcalc,
title={Medcalc-bench: Evaluating large language models for medical calculations},
author={Khandekar, Nikhil and Jin, Qiao and Xiong, Guangzhi and Dunn, Soren and Applebaum, Serina and Anwar, Zain and Sarfo-Gyamfi, Maame and Safranek, Conrad and Anwar, Abid and Zhang, Andrew and others},
journal={Advances in Neural Information Processing Systems},
volume={37},
pages={84730--84745},
year={2024}
}
```
许可证:CC BY-SA 4.0
标签:
- 医疗
Open-Patients 是一款聚合数据集,整合了四个公开患者笔记开源数据集的内容。
本数据集共包含180142条患者描述,全部存储于`Open-Patients.jsonl`文件中。每条数据项包含两个属性:
1. `_id`:用于标识该数据项的来源数据集,以及其在原数据集中的索引编号。
2. `description`:从公开患者笔记数据集提取的原始患者笔记内容。
本数据集的患者笔记与考题来自以下四个数据集:
1. **文本检索会议(Text REtrieval Conference, TREC)临床决策支持(Clinical Decision Support, CDS)赛道**:该赛道在2014至2016三年间,每年各包含30条患者笔记。本赛道的设计初衷为挑战参与者检索可用于解答特定患者笔记相关潜在问题的相关文献。2014年与2015年的患者笔记[2014](https://www.trec-cds.org/2014.html)、[2015](https://www.trec-cds.org/2014.html)为由接受过医学培训的人员手写的合成患者笔记,而2016年的数据集[2016](https://www.trec-cds.org/2016.html)则包含源自电子健康记录(Electronic Health Record, EHR)的真实患者摘要。此类笔记的`_id`格式为`trec-cds-{年份}-{笔记编号}`,其中年份取值范围为2014至2016,“笔记编号”为对应年份数据集中该笔记的索引序号。
2. **文本检索会议(Text REtrieval Conference, TREC)临床试验(Clinical Trials, CT)赛道**:该赛道共包含125条患者笔记,其中50条[2021](https://www.trec-cds.org/2021.html)来自2021年,75条[2022](https://www.trec-cds.org/2022.html)来自2022年。本赛道旨在让参与者从ClinicalTrials.gov中检索与患者笔记描述的症状最匹配的既往临床试验。两个赛道的笔记均为接受过医学培训的人员编写的合成笔记,用于模拟电子健康记录中的入院陈述。此类笔记的`_id`格式为`trec-ct-{年份}-{笔记编号}`,其中年份取值为2021或2022,“笔记编号”为对应年份数据集中该笔记的索引序号。
3. **MedQA-USMLE(美国赛道)**:该[数据集](https://paperswithcode.com/dataset/medqa-usmle)包含14369道来自美国医学执照考试(United States Medical Licensing Examination, USMLE)的选择题,题目会给出患者的临床摘要,并基于提供的信息设置问题。由于并非所有题目均涉及患者病例,我们筛选出包含患者的题目,最终从该数据集中选用12893道题目。此类题目作为MedQA数据集的一部分,用于研发检索相关文献并结合语言模型辅助解题的方法。此类笔记的`_id`格式为`usmle-{题目索引编号}`,其中“题目索引编号”为MedQA数据集中`US_qbank.jsonl`文件内所有USMLE题目的索引序号。
5. **PMC-Patients**:该[数据集](https://pmc-patients.github.io/)包含167034条从PubMed Central(PMC)收录的患者笔记。本数据集的用途为基准测试基于检索的临床决策支持系统(Retrieval-based Clinical Decision Support Systems, ReCDS)的性能:针对给定的患者笔记,评估模型检索相似患者笔记及PMC中相关文献的能力。此类笔记的`_id`格式为`pmc-{患者ID}`,其中“患者ID”为PMC-Patients数据集的`pmc-patients.json`文件中各患者笔记的`patient_uid`属性值。
我们期望本患者摘要与医学考题数据集能够帮助研究人员基准测试大语言模型(Large Language Model, LLM)在医疗实体抽取方面的性能,以及评估大语言模型利用抽取得到的实体完成各类医疗计算的能力。
若您认为本数据集对您的研究有所帮助,请通过以下引用格式引用我们的论文:
bibtex
@article{khandekar2024medcalc,
title={Medcalc-bench: Evaluating large language models for medical calculations},
author={Khandekar, Nikhil and Jin, Qiao and Xiong, Guangzhi and Dunn, Soren and Applebaum, Serina and Anwar, Zain and Sarfo-Gyamfi, Maame and Safranek, Conrad and Anwar, Abid and Zhang, Andrew and others},
journal={Advances in Neural Information Processing Systems},
volume={37},
pages={84730--84745},
year={2024}
}
提供机构:
abuhoraira06



