richardyoung/synthea-575k-patients
收藏Hugging Face2026-03-23 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/richardyoung/synthea-575k-patients
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
task_categories:
- text-generation
- tabular-classification
- tabular-regression
language:
- en
tags:
- healthcare
- synthetic-data
- medical
- clinical
- ehr
- electronic-health-records
- synthea
- education
- tutorial
- beginner-friendly
size_categories:
- 100K<n<1M
---
# Synthea Synthetic Patient Records (575K Patients)
A comprehensive synthetic healthcare dataset containing **575,415 patients** with complete medical histories, generated using [Synthea](https://github.com/synthetichealth/synthea) — the gold standard for synthetic EHR data.
**No real patient data.** Fully synthetic, HIPAA-safe, and ready for ML research and education.
## Why This Dataset?
- **575K patients** with realistic demographics, conditions, medications, and encounters
- **Privacy-safe**: No real PHI — use freely in research, teaching, and production
- **Complete records**: Conditions, medications, procedures, encounters, observations, and more
- **Parquet format**: Fast loading with pandas, polars, or HF datasets
## Quick Start
```python
from datasets import load_dataset
# Load the full dataset
ds = load_dataset("richardyoung/synthea-575k-patients")
# Or load a specific table
patients = load_dataset("richardyoung/synthea-575k-patients", data_files="patients.parquet")
# Explore
print(f"Patients: {len(ds['train']):,}")
print(ds['train'].column_names)
print(ds['train'][0])
```
### With pandas
```python
import pandas as pd
from huggingface_hub import hf_hub_download
path = hf_hub_download(
repo_id="richardyoung/synthea-575k-patients",
filename="patients.parquet",
repo_type="dataset",
)
df = pd.read_parquet(path)
print(df.head())
print(f"Shape: {df.shape}")
```
## Dataset Structure
| Table | Description | Key Fields |
|---|---|---|
| patients | Patient demographics | birthdate, gender, race, ethnicity, city, state |
| conditions | Diagnoses/conditions | code, description, start/stop dates |
| medications | Prescriptions | code, description, start/stop, reason |
| encounters | Clinical visits | type, code, description, cost |
| procedures | Medical procedures | code, description, cost |
| observations | Lab results & vitals | code, description, value, units |
| allergies | Patient allergies | code, description, type |
| immunizations | Vaccination records | code, description, date |
| careplans | Treatment plans | code, description, reason |
## Use Cases
- **ML training**: Build classifiers for disease prediction, readmission risk, mortality
- **NLP**: Train models on clinical text and medical terminology
- **Education**: Teach healthcare data science without privacy concerns
- **Benchmarking**: Standardized dataset for comparing healthcare ML approaches
- **RAG systems**: Build medical Q&A systems with realistic clinical data
## Related Work
This dataset supports the **CardioEmbed** research project — domain-adapted embeddings for cardiology:
- [CardioEmbed (Qwen3-8B)](https://hf.co/richardyoung/CardioEmbed)
- [CardioEmbed-BioLinkBERT](https://hf.co/richardyoung/CardioEmbed-BioLinkBERT)
## Citation
If you use this dataset, please cite Synthea:
```bibtex
@article{walonoski2018synthea,
title={Synthea: An approach, method, and software mechanism for generating synthetic patients and the synthetic electronic health care record},
author={Walonoski, Jason and others},
journal={Journal of the American Medical Informatics Association},
year={2018}
}
```
## Other Models by richardyoung
- **Abliterated/Uncensored models**: [Qwen2.5-7B](https://hf.co/richardyoung/Qwen2.5-7B-Instruct-abliterated-GGUF) | [Qwen3-14B](https://hf.co/richardyoung/Qwen3-14B-abliterated-GGUF) | [DeepSeek-R1-32B](https://hf.co/richardyoung/Deepseek-R1-Distill-Qwen-32b-uncensored) | [Qwen3-8B](https://hf.co/richardyoung/Qwen3-8B-Abliterated)
- **MLX quantizations (Apple Silicon)**: [Kimi-K2 series](https://hf.co/richardyoung/Kimi-K2-Instruct-0905-MLX-4bit) | [olmOCR MLX](https://hf.co/richardyoung/olmOCR-2-7B-1025-MLX-4bit)
- **OCR & Vision**: [olmOCR GGUF](https://hf.co/richardyoung/olmOCR-2-7B-1025-GGUF)
- **Healthcare/Medical**: [Synthea 575K patients dataset](https://hf.co/datasets/richardyoung/synthea-575k-patients) | [CardioEmbed](https://hf.co/richardyoung/CardioEmbed)
- **Research**: [LLM Instruction-Following Evaluation](https://hf.co/richardyoung/llm-instruction-following-paper) (arxiv:2510.18892)
提供机构:
richardyoung



