five

Wilhelm-Foundation/rare-archive-synthetic-patients

收藏
Hugging Face2026-03-26 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Wilhelm-Foundation/rare-archive-synthetic-patients
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: config_name: default features: - name: patient_id dtype: string - name: clinical_vignette dtype: string - name: ground_truth_diagnosis dtype: string - name: disease_id dtype: string - name: hpo_terms_present sequence: string - name: hpo_terms_absent sequence: string - name: age dtype: int64 - name: sex dtype: string - name: difficulty dtype: string - name: patient_category dtype: string - name: family_history dtype: string splits: - name: train num_examples: 12984 tags: - rare-disease - clinical-diagnostics - synthetic_patients - rare-ai-archive - training-data - synthetic task_categories: - text-generation - question-answering language: - en size_categories: - 10K<n<100K license: cc-by-nc-sa-4.0 --- # Rare Archive Synthetic Patients — SFT Training Data 12,984 synthetic rare disease patient vignettes generated from Orphanet disease profiles. Designed for supervised fine-tuning (SFT) of diagnostic AI models. Part of the [Rare AI Archive](https://github.com/Wilhelm-Foundation/rare-archive). > **All patients are computationally generated. Zero real patient data. Zero PHI.** > This dataset contains no Protected Health Information. Every vignette is synthetically generated from public Orphanet disease profiles using frequency-weighted phenotype sampling. No real patients were involved in any stage of data creation. > **Research use only.** This dataset is training data for AI research. It is NOT intended for clinical decision-making. ## Dataset Description - **Repository:** [Wilhelm-Foundation/rare-archive-synthetic-patients](https://huggingface.co/datasets/Wilhelm-Foundation/rare-archive-synthetic-patients) - **License:** CC BY-NC-SA 4.0 - **Version:** 0.1.0 - **Part of:** [Rare AI Archive](https://github.com/Wilhelm-Foundation/rare-archive) · [Complete Toolkit Collection](https://huggingface.co/collections/Wilhelm-Foundation/rare-ai-archive-complete-toolkit-69c4b1e14800a370fe028851) ## Generation Methodology ### Pipeline Overview ``` Orphanet Disease Profiles (Orphadata API) ↓ HPO Phenotype Enrichment (rd-phenotypes endpoint) ↓ Frequency-Weighted Symptom Sampling ↓ Difficulty Tier Assignment (easy / medium / hard) ↓ Clinical Vignette Generation ↓ Family History Generation ↓ 12,984 Synthetic Patient Records ``` ### Detailed Process 1. **Disease profile fetching**: ~4,500 rare disease profiles retrieved from the [Orphadata](https://www.orphadata.com/) rd-phenotypes endpoint, each with associated HPO (Human Phenotype Ontology) term annotations and frequency data 2. **Frequency-weighted HPO sampling**: For each synthetic patient, HPO terms are sampled based on their documented frequency in the disease profile (obligate > very frequent > frequent > occasional > very rare) 3. **Difficulty tiers**: Each patient is assigned one of three difficulty levels: - **Easy**: Core phenotypic features prominent, classic presentation - **Medium**: Mix of core and variable features, some atypical elements - **Hard**: Atypical presentation, overlapping phenotypes, diagnostic distractors 4. **Vignette generation**: Sampled HPO terms composed into naturalistic clinical narrative with age, sex, and presentation context 5. **Family history**: Generated to reflect inheritance patterns documented in the disease profile ### Data Quality Notes - Vignettes are generated programmatically, not by LLM — ensuring reproducibility - Difficulty distribution is approximately balanced across tiers - Each disease may have 1-10 synthetic patients depending on phenotype richness ## Ecosystem Context Synthetic patients are the **foundation of the training flywheel**. Generated from Orphanet disease profiles across ~4,500 rare diseases, they provide the broad disease coverage that no single institution could collect on its own. This dataset works in concert with the rest of the ecosystem: - **Context Creators** (clinicians, patient advocates) contribute structured vignettes from [Undiagnosed Patient Hackathons](https://www.nature.com/articles/d41586-026-00302-8) that complement synthetic coverage with real diagnostic reasoning patterns - **Validators** evaluate model outputs in the [ELO Arena](https://github.com/Wilhelm-Foundation/rare-archive/blob/main/ARCHITECTURE.md), and their corrections are exported as additional training data that augments this dataset - **Model Builders** use this dataset plus Arena corrections to train [condition-specific adapters](https://huggingface.co/Wilhelm-Foundation/rare-archive-qwen-4b-sft-v1#condition-specific-models) for disease clusters like IEM, Neuromuscular, and more The correction-to-retrain cycle means this dataset grows smarter with every clinician interaction — corrections are merged with synthetic cases for the next training run. ## Dataset Structure ### Data Fields | Field | Type | Description | |-------|------|-------------| | `patient_id` | string | Unique synthetic patient identifier | | `clinical_vignette` | string | Generated clinical presentation narrative | | `ground_truth_diagnosis` | string | Disease name (ground truth label) | | `disease_id` | string | Orphanet disease ID | | `hpo_terms_present` | list[string] | HPO terms sampled as present in the patient | | `hpo_terms_absent` | list[string] | HPO terms sampled as absent | | `age` | int | Patient age | | `sex` | string | Patient sex (M/F) | | `difficulty` | string | Difficulty tier: easy, medium, or hard | | `patient_category` | string | Patient category classification | | `family_history` | string | Generated family history narrative | ### Data Splits | Split | Records | Purpose | |-------|---------|---------| | train | 12,984 | Supervised fine-tuning (SFT) training data | ## Intended Use **Primary use**: Stage 1 SFT (Supervised Fine-Tuning) training data for rare disease diagnostic models. - Fine-tune language models to recognize rare disease presentations - Augment real clinical case data (RareArena RDS/RDC) with broader disease coverage - Pre-training exposure to rare disease phenotype patterns **NOT intended for**: Clinical decision-making, patient diagnosis, or evaluation benchmarking (use [RDS](https://huggingface.co/datasets/Wilhelm-Foundation/rare-archive-eval-rarearena-rds) or [RDC](https://huggingface.co/datasets/Wilhelm-Foundation/rare-archive-eval-rarearena-rdc) for evaluation). **Part of the ecosystem flywheel**: Synthetic patients → model training → Arena evaluation → clinician corrections → merged back into training data → better models. Each cycle widens disease coverage and deepens diagnostic accuracy. ## Bias, Risks & Limitations ### Known Biases | Bias Category | Description | Impact | |---------------|-------------|--------| | **Orphanet coverage** | Limited to ~4,500 of 7,000+ known rare diseases | Diseases without Orphanet phenotype profiles are absent | | **Formulaic structure** | Programmatic generation produces patterned vignettes | Models may learn structural cues rather than clinical reasoning | | **Phenotypic simplification** | Independent HPO term sampling ignores complex co-occurrence patterns | Real patients exhibit correlated symptoms that this sampling misses | | **Frequency bias** | High-frequency HPO terms dominate easy/medium cases | Under-exposure to rare phenotypic variants within diseases | | **Age/sex distribution** | Generated to match disease epidemiology | May not reflect clinical presentation variation across demographics | | **No lab data** | Vignettes contain symptoms and history only | No laboratory, imaging, or genetic testing results | ### Risks - **Training on synthetic data only is insufficient** — models trained exclusively on synthetic vignettes will learn simplified patterns. Real clinical data (RDS/RDC) is essential for evaluation and supplementary training. - **Vignette fidelity**: Synthetically generated presentations may not capture the diagnostic complexity of real patients - **Ground truth quality**: Disease labels are inherited from Orphanet; misclassifications in source data propagate ### Limitations - Monolingual (English only) - No structured lab results (symptom-based vignettes only) - No temporal progression — each vignette is a snapshot - Difficulty tiers are heuristic, not clinically validated - Some diseases have very few synthetic patients (1-2) due to sparse phenotype profiles ## Loading the Dataset ### Using HuggingFace Datasets ```python from datasets import load_dataset # Load the full dataset dataset = load_dataset("Wilhelm-Foundation/rare-archive-synthetic-patients") # Access the training split train = dataset["train"] print(f"Total patients: {len(train)}") # Inspect a single patient patient = train[0] print(f"Diagnosis: {patient['ground_truth_diagnosis']}") print(f"Difficulty: {patient['difficulty']}") print(f"HPO terms present: {len(patient['hpo_terms_present'])}") print(f"Vignette: {patient['clinical_vignette'][:200]}...") ``` ### Using Pandas ```python import pandas as pd df = pd.read_parquet( "hf://datasets/Wilhelm-Foundation/rare-archive-synthetic-patients/data/train-00000-of-00001.parquet" ) print(f"Shape: {df.shape}") print(f"Unique diseases: {df['disease_id'].nunique()}") print(f"Difficulty distribution:\n{df['difficulty'].value_counts()}") ``` ### Filtering by Difficulty ```python from datasets import load_dataset dataset = load_dataset("Wilhelm-Foundation/rare-archive-synthetic-patients", split="train") # Get only hard cases for challenging fine-tuning hard_cases = dataset.filter(lambda x: x["difficulty"] == "hard") print(f"Hard cases: {len(hard_cases)}") ``` ### Preparing for SFT ```python from datasets import load_dataset dataset = load_dataset("Wilhelm-Foundation/rare-archive-synthetic-patients", split="train") # Convert to chat format for SFT def to_chat_format(example): return { "messages": [ {"role": "system", "content": "You are a rare disease specialist. Given a clinical presentation, provide your differential diagnosis with reasoning."}, {"role": "user", "content": example["clinical_vignette"]}, {"role": "assistant", "content": f"Based on the clinical presentation, my primary diagnosis is {example['ground_truth_diagnosis']}."} ] } sft_dataset = dataset.map(to_chat_format) ``` ## Related Resources | Resource | Link | |----------|------| | **Model** | [rare-archive-qwen-4b-sft-v1](https://huggingface.co/Wilhelm-Foundation/rare-archive-qwen-4b-sft-v1) — 4B SFT model trained using this data | | **RDS Benchmark** | [rare-archive-eval-rarearena-rds](https://huggingface.co/datasets/Wilhelm-Foundation/rare-archive-eval-rarearena-rds) — 8,562 evaluation cases | | **RDC Benchmark** | [rare-archive-eval-rarearena-rdc](https://huggingface.co/datasets/Wilhelm-Foundation/rare-archive-eval-rarearena-rdc) — 4,376 cases with lab results | | **Clinical Demo** | [rare-archive-clinical-demo](https://huggingface.co/spaces/Wilhelm-Foundation/rare-archive-clinical-demo) — Interactive demo Space | | **Collection** | [Rare AI Archive — Complete Toolkit](https://huggingface.co/collections/Wilhelm-Foundation/rare-ai-archive-complete-toolkit-69c4b1e14800a370fe028851) | | **GitHub** | [Wilhelm-Foundation/rare-archive](https://github.com/Wilhelm-Foundation/rare-archive) | ## Citation ```bibtex @misc{rare-archive-synthetic-patients, title={Rare Archive Synthetic Patients: Frequency-Weighted HPO Sampling for Rare Disease SFT}, author={Wilhelm Foundation}, year={2026}, publisher={HuggingFace}, url={https://huggingface.co/datasets/Wilhelm-Foundation/rare-archive-synthetic-patients} } ``` *A program of the [Wilhelm Foundation](https://wilhelm.foundation). Built on [Lattice Protocol](https://github.com/LatticeProtocol). No disease is too rare to matter.*
提供机构:
Wilhelm-Foundation
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作