vidulpanickan/TinyEHR
收藏Hugging Face2026-04-03 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/vidulpanickan/TinyEHR
下载链接
链接失效反馈官方服务:
资源简介:
---
license: odbl
language:
- en
size_categories:
- 1M<n<10M
multilinguality:
- monolingual
source_datasets:
- physionet/mimic-iv-demo
task_categories:
- table-question-answering
tags:
- medical
- clinical
- ehr
- mimic
- omop
- healthcare
- agentic
- clinical-notes
- clinical-nlp
- tabular
pretty_name: TinyEHR
configs:
- config_name: mimic_admissions
data_files: "tinyehr_mimic_format/admissions.parquet"
- config_name: mimic_caregiver
data_files: "tinyehr_mimic_format/caregiver.parquet"
- config_name: mimic_chartevents
data_files: "tinyehr_mimic_format/chartevents.parquet"
- config_name: mimic_d_hcpcs
data_files: "tinyehr_mimic_format/d_hcpcs.parquet"
- config_name: mimic_d_icd_diagnoses
data_files: "tinyehr_mimic_format/d_icd_diagnoses.parquet"
- config_name: mimic_d_icd_procedures
data_files: "tinyehr_mimic_format/d_icd_procedures.parquet"
- config_name: mimic_d_items
data_files: "tinyehr_mimic_format/d_items.parquet"
- config_name: mimic_d_labitems
data_files: "tinyehr_mimic_format/d_labitems.parquet"
- config_name: mimic_date_offsets
data_files: "tinyehr_mimic_format/date_offsets.parquet"
- config_name: mimic_datetimeevents
data_files: "tinyehr_mimic_format/datetimeevents.parquet"
- config_name: mimic_diagnoses_icd
data_files: "tinyehr_mimic_format/diagnoses_icd.parquet"
- config_name: mimic_drgcodes
data_files: "tinyehr_mimic_format/drgcodes.parquet"
- config_name: mimic_emar
data_files: "tinyehr_mimic_format/emar.parquet"
- config_name: mimic_emar_detail
data_files: "tinyehr_mimic_format/emar_detail.parquet"
- config_name: mimic_hcpcsevents
data_files: "tinyehr_mimic_format/hcpcsevents.parquet"
- config_name: mimic_icustays
data_files: "tinyehr_mimic_format/icustays.parquet"
- config_name: mimic_ingredientevents
data_files: "tinyehr_mimic_format/ingredientevents.parquet"
- config_name: mimic_inputevents
data_files: "tinyehr_mimic_format/inputevents.parquet"
- config_name: mimic_labevents
data_files: "tinyehr_mimic_format/labevents.parquet"
- config_name: mimic_microbiologyevents
data_files: "tinyehr_mimic_format/microbiologyevents.parquet"
- config_name: mimic_noteevents
data_files: "tinyehr_mimic_format/noteevents.parquet"
- config_name: mimic_omr
data_files: "tinyehr_mimic_format/omr.parquet"
- config_name: mimic_outputevents
data_files: "tinyehr_mimic_format/outputevents.parquet"
- config_name: mimic_patients
data_files: "tinyehr_mimic_format/patients.parquet"
- config_name: mimic_pharmacy
data_files: "tinyehr_mimic_format/pharmacy.parquet"
- config_name: mimic_poe
data_files: "tinyehr_mimic_format/poe.parquet"
- config_name: mimic_poe_detail
data_files: "tinyehr_mimic_format/poe_detail.parquet"
- config_name: mimic_prescriptions
data_files: "tinyehr_mimic_format/prescriptions.parquet"
- config_name: mimic_procedureevents
data_files: "tinyehr_mimic_format/procedureevents.parquet"
- config_name: mimic_procedures_icd
data_files: "tinyehr_mimic_format/procedures_icd.parquet"
- config_name: mimic_provider
data_files: "tinyehr_mimic_format/provider.parquet"
- config_name: mimic_services
data_files: "tinyehr_mimic_format/services.parquet"
- config_name: mimic_transfers
data_files: "tinyehr_mimic_format/transfers.parquet"
- config_name: omop_2b_concept
data_files: "tinyehr_omop_format/2b_concept.parquet"
- config_name: omop_2b_concept_relationship
data_files: "tinyehr_omop_format/2b_concept_relationship.parquet"
- config_name: omop_2b_vocabulary
data_files: "tinyehr_omop_format/2b_vocabulary.parquet"
- config_name: omop_attribute_definition
data_files: "tinyehr_omop_format/attribute_definition.parquet"
- config_name: omop_care_site
data_files: "tinyehr_omop_format/care_site.parquet"
- config_name: omop_cdm_source
data_files: "tinyehr_omop_format/cdm_source.parquet"
- config_name: omop_cohort
data_files: "tinyehr_omop_format/cohort.parquet"
- config_name: omop_cohort_attribute
data_files: "tinyehr_omop_format/cohort_attribute.parquet"
- config_name: omop_cohort_definition
data_files: "tinyehr_omop_format/cohort_definition.parquet"
- config_name: omop_condition_era
data_files: "tinyehr_omop_format/condition_era.parquet"
- config_name: omop_condition_occurrence
data_files: "tinyehr_omop_format/condition_occurrence.parquet"
- config_name: omop_cost
data_files: "tinyehr_omop_format/cost.parquet"
- config_name: omop_death
data_files: "tinyehr_omop_format/death.parquet"
- config_name: omop_device_exposure
data_files: "tinyehr_omop_format/device_exposure.parquet"
- config_name: omop_dose_era
data_files: "tinyehr_omop_format/dose_era.parquet"
- config_name: omop_drug_era
data_files: "tinyehr_omop_format/drug_era.parquet"
- config_name: omop_drug_exposure
data_files: "tinyehr_omop_format/drug_exposure.parquet"
- config_name: omop_fact_relationship
data_files: "tinyehr_omop_format/fact_relationship.parquet"
- config_name: omop_location
data_files: "tinyehr_omop_format/location.parquet"
- config_name: omop_measurement
data_files: "tinyehr_omop_format/measurement.parquet"
- config_name: omop_metadata
data_files: "tinyehr_omop_format/metadata.parquet"
- config_name: omop_note
data_files: "tinyehr_omop_format/note.parquet"
- config_name: omop_note_nlp
data_files: "tinyehr_omop_format/note_nlp.parquet"
- config_name: omop_observation
data_files: "tinyehr_omop_format/observation.parquet"
- config_name: omop_observation_period
data_files: "tinyehr_omop_format/observation_period.parquet"
- config_name: omop_payer_plan_period
data_files: "tinyehr_omop_format/payer_plan_period.parquet"
- config_name: omop_person
data_files: "tinyehr_omop_format/person.parquet"
- config_name: omop_procedure_occurrence
data_files: "tinyehr_omop_format/procedure_occurrence.parquet"
- config_name: omop_provider
data_files: "tinyehr_omop_format/provider.parquet"
- config_name: omop_specimen
data_files: "tinyehr_omop_format/specimen.parquet"
- config_name: omop_visit_detail
data_files: "tinyehr_omop_format/visit_detail.parquet"
- config_name: omop_visit_occurrence
data_files: "tinyehr_omop_format/visit_occurrence.parquet"
---
# TinyEHR
**v0.2.0** | [GitHub](https://github.com/vidulpanickan/TinyEHR) | [Website](https://tinyehr.org) | [PyPI](https://pypi.org/project/tinyehr/)
A `100` patient dataset of Electronic Health Records, built for learning, experimenting, and prototyping healthcare data tools and AI agentic systems. Typically, working with real healthcare data requires credentialing and data access agreements. TinyEHR is free to use.
> **This dataset is for learning, prototyping, and exploration only. It should not be used for clinical analysis, medical decision-making, or patient care.**
This dataset is derived from real EHR data from Beth Israel Deaconess Medical Center (BIDMC) in Boston, US. The data has been de-identified, meaning it has been stripped of any information that could identify the patient such as names, medical record numbers, and addresses to protect patient privacy. This dataset contains no protected health information (PHI).
| Stat | Value |
|------|-------|
| Patients | 100 |
| Hospital admissions | 275 |
| ICU stays | 140 |
| Clinical notes | 4,580 |
| Gender | 43 F / 57 M |
| Date range | 2011 - 2022 |
| Tables (MIMIC) | 33 |
| Tables (OMOP) | 32 |
**Browse the dataset**: explore 30+ tables, column definitions, and relationships across MIMIC-IV and OMOP formats.
[](https://tinyehr.org)
**AI assisted SQL**: ask your queries in plain English.
[](https://tinyehr.org)
## Quick Start
```python
from datasets import load_dataset
patients = load_dataset("vidulpanickan/TinyEHR", "mimic_patients")
admissions = load_dataset("vidulpanickan/TinyEHR", "mimic_admissions")
notes = load_dataset("vidulpanickan/TinyEHR", "mimic_noteevents")
```
Also available as a Python package: `pip install tinyehr` ([PyPI](https://pypi.org/project/tinyehr/))
## What does the data look like?
**patients** (`subject_id` = patient ID, `anchor_age` = age at anchor year, `dod` = date of death):
```json
{
"subject_id": 10014729,
"gender": "F",
"anchor_age": 21,
"anchor_year": 2013,
"anchor_year_group": "2011 - 2013",
"dod": null
}
```
**noteevents** (`hadm_id` = hospital admission ID, `note_type` = type of clinical note):
```json
{
"note_id": "10014729-DS-0001",
"subject_id": 10014729,
"hadm_id": 23300884,
"note_type": "Discharge summary",
"chartdate": "2013-03-19",
"text": "Admission Date: 2013-03-19 Discharge Date: 2013-03-28\n\nDOB: 1992 Sex: F\n\nService: VSURG → CSURG\n\nAttending: Dr. Katriel Silvane\n\nALLERGIES: NKDA\n\nCC: Postop wound infection s/p thoracotomy..."
}
```
There are 30+ tables covering admissions, diagnoses, lab results, medications, procedures, vitals, clinical notes, and more. Explore all tables at [tinyehr.org](https://tinyehr.org).
## Two Formats
| Format | Tables | Rows | Best for |
|--------|--------|------|----------|
| `tinyehr_mimic_format` | 33 | ~1.4M | Learning how hospital data works |
| `tinyehr_omop_format` | 32 | ~472K | Building tools that work across health systems |
**MIMIC-IV format** follows the original MIMIC-IV schema. If you're new to EHR data, start here.
**OMOP CDM v5.3.1 format** reorganizes the same data into a universal schema where diagnoses, labs, and medications are mapped to standardized medical vocabularies.
Full details: [ABOUT_THE_DATA.md](https://github.com/vidulpanickan/TinyEHR/blob/main/ABOUT_THE_DATA.md)
## Usage
- Build and test AI agents that query, reason over, and navigate real hospital data
- Prototype clinical NLP and text-to-SQL systems against realistic clinical notes and multi-table schemas
- Learn how EHR data is structured across MIMIC-IV and OMOP formats
## Known Limitations
- **100 patients only**: this is a learning and prototyping dataset, not statistically representative of any population
- **Clinical notes are generated and not validated**: the notes were generated using Anthropic's Claude Opus 4.6, grounded in each patient's structured data during their hospital visit. They have not been validated by clinicians and may contain hallucinated or inaccurate clinical details (e.g., incorrect ages, fabricated findings, inconsistent timelines). They should not be treated as clinically accurate
- **Single institution**: all data comes from one US academic medical center (Beth Israel Deaconess Medical Center in Boston), so demographics and clinical patterns reflect this specific patient population
- **OMOP vocabulary subset**: the OMOP format uses a subset of the full OHDSI Athena vocabulary, limited to the concepts needed for these 100 patients
## Roadmap
- Synthetic clinical notes authored by clinicians (currently generated by LLM)
- Additional data modalities including medical imaging (X-ray, CT scan)
## Citation
If you use TinyEHR in your work, please cite:
```bibtex
@misc{tinyehr2026,
title={TinyEHR: A 100 Patient Electronic Health Records Dataset for Learning and Prototyping Agentic AI},
author={Vidul Ayakulangara Panickan},
year={2026},
url={https://github.com/vidulpanickan/TinyEHR}
}
```
## Source Citations
1. Johnson, A., Bulgarelli, L., Pollard, T., Horng, S., Celi, L. A., & Mark, R. (2023). MIMIC-IV, a freely accessible electronic health record dataset. *Scientific Data*, 10(1), 1. https://doi.org/10.1038/s41597-022-01899-x
2. Johnson, A., Bulgarelli, L., Pollard, T., Horng, S., Celi, L. A., & Mark, R. (2023). MIMIC-IV Clinical Database Demo (version 2.2). *PhysioNet*. https://doi.org/10.13026/dp1f-ex47
3. Kallfelz, M., Tsvetkova, A., Pollard, T., Kwong, M., Lipori, G., Huser, V., Osborn, J., Hao, S., & Williams, A. (2021). MIMIC-IV Demo Data in the OMOP Common Data Model (version 0.9). *PhysioNet*. https://doi.org/10.13026/p1f5-7x35
## License
[ODbL-1.0](https://opendatacommons.org/licenses/odbl/1-0/) (Open Data Commons Open Database License). Free to use, share, and modify. Redistributed versions must use the same license.
提供机构:
vidulpanickan



