five

richardyoung/synthea-575k-patients

收藏
Hugging Face2026-03-23 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/richardyoung/synthea-575k-patients
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit task_categories: - text-generation - tabular-classification - tabular-regression language: - en tags: - healthcare - synthetic-data - medical - clinical - ehr - electronic-health-records - synthea - education - tutorial - beginner-friendly size_categories: - 100K<n<1M --- # Synthea Synthetic Patient Records (575K Patients) A comprehensive synthetic healthcare dataset containing **575,415 patients** with complete medical histories, generated using [Synthea](https://github.com/synthetichealth/synthea) — the gold standard for synthetic EHR data. **No real patient data.** Fully synthetic, HIPAA-safe, and ready for ML research and education. ## Why This Dataset? - **575K patients** with realistic demographics, conditions, medications, and encounters - **Privacy-safe**: No real PHI — use freely in research, teaching, and production - **Complete records**: Conditions, medications, procedures, encounters, observations, and more - **Parquet format**: Fast loading with pandas, polars, or HF datasets ## Quick Start ```python from datasets import load_dataset # Load the full dataset ds = load_dataset("richardyoung/synthea-575k-patients") # Or load a specific table patients = load_dataset("richardyoung/synthea-575k-patients", data_files="patients.parquet") # Explore print(f"Patients: {len(ds['train']):,}") print(ds['train'].column_names) print(ds['train'][0]) ``` ### With pandas ```python import pandas as pd from huggingface_hub import hf_hub_download path = hf_hub_download( repo_id="richardyoung/synthea-575k-patients", filename="patients.parquet", repo_type="dataset", ) df = pd.read_parquet(path) print(df.head()) print(f"Shape: {df.shape}") ``` ## Dataset Structure | Table | Description | Key Fields | |---|---|---| | patients | Patient demographics | birthdate, gender, race, ethnicity, city, state | | conditions | Diagnoses/conditions | code, description, start/stop dates | | medications | Prescriptions | code, description, start/stop, reason | | encounters | Clinical visits | type, code, description, cost | | procedures | Medical procedures | code, description, cost | | observations | Lab results & vitals | code, description, value, units | | allergies | Patient allergies | code, description, type | | immunizations | Vaccination records | code, description, date | | careplans | Treatment plans | code, description, reason | ## Use Cases - **ML training**: Build classifiers for disease prediction, readmission risk, mortality - **NLP**: Train models on clinical text and medical terminology - **Education**: Teach healthcare data science without privacy concerns - **Benchmarking**: Standardized dataset for comparing healthcare ML approaches - **RAG systems**: Build medical Q&A systems with realistic clinical data ## Related Work This dataset supports the **CardioEmbed** research project — domain-adapted embeddings for cardiology: - [CardioEmbed (Qwen3-8B)](https://hf.co/richardyoung/CardioEmbed) - [CardioEmbed-BioLinkBERT](https://hf.co/richardyoung/CardioEmbed-BioLinkBERT) ## Citation If you use this dataset, please cite Synthea: ```bibtex @article{walonoski2018synthea, title={Synthea: An approach, method, and software mechanism for generating synthetic patients and the synthetic electronic health care record}, author={Walonoski, Jason and others}, journal={Journal of the American Medical Informatics Association}, year={2018} } ``` ## Other Models by richardyoung - **Abliterated/Uncensored models**: [Qwen2.5-7B](https://hf.co/richardyoung/Qwen2.5-7B-Instruct-abliterated-GGUF) | [Qwen3-14B](https://hf.co/richardyoung/Qwen3-14B-abliterated-GGUF) | [DeepSeek-R1-32B](https://hf.co/richardyoung/Deepseek-R1-Distill-Qwen-32b-uncensored) | [Qwen3-8B](https://hf.co/richardyoung/Qwen3-8B-Abliterated) - **MLX quantizations (Apple Silicon)**: [Kimi-K2 series](https://hf.co/richardyoung/Kimi-K2-Instruct-0905-MLX-4bit) | [olmOCR MLX](https://hf.co/richardyoung/olmOCR-2-7B-1025-MLX-4bit) - **OCR & Vision**: [olmOCR GGUF](https://hf.co/richardyoung/olmOCR-2-7B-1025-GGUF) - **Healthcare/Medical**: [Synthea 575K patients dataset](https://hf.co/datasets/richardyoung/synthea-575k-patients) | [CardioEmbed](https://hf.co/richardyoung/CardioEmbed) - **Research**: [LLM Instruction-Following Evaluation](https://hf.co/richardyoung/llm-instruction-following-paper) (arxiv:2510.18892)
提供机构:
richardyoung
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作