Wilhelm-Foundation/rare-archive-synthetic-patients
收藏Hugging Face2026-03-26 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Wilhelm-Foundation/rare-archive-synthetic-patients
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
config_name: default
features:
- name: patient_id
dtype: string
- name: clinical_vignette
dtype: string
- name: ground_truth_diagnosis
dtype: string
- name: disease_id
dtype: string
- name: hpo_terms_present
sequence: string
- name: hpo_terms_absent
sequence: string
- name: age
dtype: int64
- name: sex
dtype: string
- name: difficulty
dtype: string
- name: patient_category
dtype: string
- name: family_history
dtype: string
splits:
- name: train
num_examples: 12984
tags:
- rare-disease
- clinical-diagnostics
- synthetic_patients
- rare-ai-archive
- training-data
- synthetic
task_categories:
- text-generation
- question-answering
language:
- en
size_categories:
- 10K<n<100K
license: cc-by-nc-sa-4.0
---
# Rare Archive Synthetic Patients — SFT Training Data
12,984 synthetic rare disease patient vignettes generated from Orphanet disease profiles. Designed for supervised fine-tuning (SFT) of diagnostic AI models. Part of the [Rare AI Archive](https://github.com/Wilhelm-Foundation/rare-archive).
> **All patients are computationally generated. Zero real patient data. Zero PHI.**
> This dataset contains no Protected Health Information. Every vignette is synthetically generated from public Orphanet disease profiles using frequency-weighted phenotype sampling. No real patients were involved in any stage of data creation.
> **Research use only.** This dataset is training data for AI research. It is NOT intended for clinical decision-making.
## Dataset Description
- **Repository:** [Wilhelm-Foundation/rare-archive-synthetic-patients](https://huggingface.co/datasets/Wilhelm-Foundation/rare-archive-synthetic-patients)
- **License:** CC BY-NC-SA 4.0
- **Version:** 0.1.0
- **Part of:** [Rare AI Archive](https://github.com/Wilhelm-Foundation/rare-archive) · [Complete Toolkit Collection](https://huggingface.co/collections/Wilhelm-Foundation/rare-ai-archive-complete-toolkit-69c4b1e14800a370fe028851)
## Generation Methodology
### Pipeline Overview
```
Orphanet Disease Profiles (Orphadata API)
↓
HPO Phenotype Enrichment (rd-phenotypes endpoint)
↓
Frequency-Weighted Symptom Sampling
↓
Difficulty Tier Assignment (easy / medium / hard)
↓
Clinical Vignette Generation
↓
Family History Generation
↓
12,984 Synthetic Patient Records
```
### Detailed Process
1. **Disease profile fetching**: ~4,500 rare disease profiles retrieved from the [Orphadata](https://www.orphadata.com/) rd-phenotypes endpoint, each with associated HPO (Human Phenotype Ontology) term annotations and frequency data
2. **Frequency-weighted HPO sampling**: For each synthetic patient, HPO terms are sampled based on their documented frequency in the disease profile (obligate > very frequent > frequent > occasional > very rare)
3. **Difficulty tiers**: Each patient is assigned one of three difficulty levels:
- **Easy**: Core phenotypic features prominent, classic presentation
- **Medium**: Mix of core and variable features, some atypical elements
- **Hard**: Atypical presentation, overlapping phenotypes, diagnostic distractors
4. **Vignette generation**: Sampled HPO terms composed into naturalistic clinical narrative with age, sex, and presentation context
5. **Family history**: Generated to reflect inheritance patterns documented in the disease profile
### Data Quality Notes
- Vignettes are generated programmatically, not by LLM — ensuring reproducibility
- Difficulty distribution is approximately balanced across tiers
- Each disease may have 1-10 synthetic patients depending on phenotype richness
## Ecosystem Context
Synthetic patients are the **foundation of the training flywheel**. Generated from Orphanet disease profiles across ~4,500 rare diseases, they provide the broad disease coverage that no single institution could collect on its own.
This dataset works in concert with the rest of the ecosystem:
- **Context Creators** (clinicians, patient advocates) contribute structured vignettes from [Undiagnosed Patient Hackathons](https://www.nature.com/articles/d41586-026-00302-8) that complement synthetic coverage with real diagnostic reasoning patterns
- **Validators** evaluate model outputs in the [ELO Arena](https://github.com/Wilhelm-Foundation/rare-archive/blob/main/ARCHITECTURE.md), and their corrections are exported as additional training data that augments this dataset
- **Model Builders** use this dataset plus Arena corrections to train [condition-specific adapters](https://huggingface.co/Wilhelm-Foundation/rare-archive-qwen-4b-sft-v1#condition-specific-models) for disease clusters like IEM, Neuromuscular, and more
The correction-to-retrain cycle means this dataset grows smarter with every clinician interaction — corrections are merged with synthetic cases for the next training run.
## Dataset Structure
### Data Fields
| Field | Type | Description |
|-------|------|-------------|
| `patient_id` | string | Unique synthetic patient identifier |
| `clinical_vignette` | string | Generated clinical presentation narrative |
| `ground_truth_diagnosis` | string | Disease name (ground truth label) |
| `disease_id` | string | Orphanet disease ID |
| `hpo_terms_present` | list[string] | HPO terms sampled as present in the patient |
| `hpo_terms_absent` | list[string] | HPO terms sampled as absent |
| `age` | int | Patient age |
| `sex` | string | Patient sex (M/F) |
| `difficulty` | string | Difficulty tier: easy, medium, or hard |
| `patient_category` | string | Patient category classification |
| `family_history` | string | Generated family history narrative |
### Data Splits
| Split | Records | Purpose |
|-------|---------|---------|
| train | 12,984 | Supervised fine-tuning (SFT) training data |
## Intended Use
**Primary use**: Stage 1 SFT (Supervised Fine-Tuning) training data for rare disease diagnostic models.
- Fine-tune language models to recognize rare disease presentations
- Augment real clinical case data (RareArena RDS/RDC) with broader disease coverage
- Pre-training exposure to rare disease phenotype patterns
**NOT intended for**: Clinical decision-making, patient diagnosis, or evaluation benchmarking (use [RDS](https://huggingface.co/datasets/Wilhelm-Foundation/rare-archive-eval-rarearena-rds) or [RDC](https://huggingface.co/datasets/Wilhelm-Foundation/rare-archive-eval-rarearena-rdc) for evaluation).
**Part of the ecosystem flywheel**: Synthetic patients → model training → Arena evaluation → clinician corrections → merged back into training data → better models. Each cycle widens disease coverage and deepens diagnostic accuracy.
## Bias, Risks & Limitations
### Known Biases
| Bias Category | Description | Impact |
|---------------|-------------|--------|
| **Orphanet coverage** | Limited to ~4,500 of 7,000+ known rare diseases | Diseases without Orphanet phenotype profiles are absent |
| **Formulaic structure** | Programmatic generation produces patterned vignettes | Models may learn structural cues rather than clinical reasoning |
| **Phenotypic simplification** | Independent HPO term sampling ignores complex co-occurrence patterns | Real patients exhibit correlated symptoms that this sampling misses |
| **Frequency bias** | High-frequency HPO terms dominate easy/medium cases | Under-exposure to rare phenotypic variants within diseases |
| **Age/sex distribution** | Generated to match disease epidemiology | May not reflect clinical presentation variation across demographics |
| **No lab data** | Vignettes contain symptoms and history only | No laboratory, imaging, or genetic testing results |
### Risks
- **Training on synthetic data only is insufficient** — models trained exclusively on synthetic vignettes will learn simplified patterns. Real clinical data (RDS/RDC) is essential for evaluation and supplementary training.
- **Vignette fidelity**: Synthetically generated presentations may not capture the diagnostic complexity of real patients
- **Ground truth quality**: Disease labels are inherited from Orphanet; misclassifications in source data propagate
### Limitations
- Monolingual (English only)
- No structured lab results (symptom-based vignettes only)
- No temporal progression — each vignette is a snapshot
- Difficulty tiers are heuristic, not clinically validated
- Some diseases have very few synthetic patients (1-2) due to sparse phenotype profiles
## Loading the Dataset
### Using HuggingFace Datasets
```python
from datasets import load_dataset
# Load the full dataset
dataset = load_dataset("Wilhelm-Foundation/rare-archive-synthetic-patients")
# Access the training split
train = dataset["train"]
print(f"Total patients: {len(train)}")
# Inspect a single patient
patient = train[0]
print(f"Diagnosis: {patient['ground_truth_diagnosis']}")
print(f"Difficulty: {patient['difficulty']}")
print(f"HPO terms present: {len(patient['hpo_terms_present'])}")
print(f"Vignette: {patient['clinical_vignette'][:200]}...")
```
### Using Pandas
```python
import pandas as pd
df = pd.read_parquet(
"hf://datasets/Wilhelm-Foundation/rare-archive-synthetic-patients/data/train-00000-of-00001.parquet"
)
print(f"Shape: {df.shape}")
print(f"Unique diseases: {df['disease_id'].nunique()}")
print(f"Difficulty distribution:\n{df['difficulty'].value_counts()}")
```
### Filtering by Difficulty
```python
from datasets import load_dataset
dataset = load_dataset("Wilhelm-Foundation/rare-archive-synthetic-patients", split="train")
# Get only hard cases for challenging fine-tuning
hard_cases = dataset.filter(lambda x: x["difficulty"] == "hard")
print(f"Hard cases: {len(hard_cases)}")
```
### Preparing for SFT
```python
from datasets import load_dataset
dataset = load_dataset("Wilhelm-Foundation/rare-archive-synthetic-patients", split="train")
# Convert to chat format for SFT
def to_chat_format(example):
return {
"messages": [
{"role": "system", "content": "You are a rare disease specialist. Given a clinical presentation, provide your differential diagnosis with reasoning."},
{"role": "user", "content": example["clinical_vignette"]},
{"role": "assistant", "content": f"Based on the clinical presentation, my primary diagnosis is {example['ground_truth_diagnosis']}."}
]
}
sft_dataset = dataset.map(to_chat_format)
```
## Related Resources
| Resource | Link |
|----------|------|
| **Model** | [rare-archive-qwen-4b-sft-v1](https://huggingface.co/Wilhelm-Foundation/rare-archive-qwen-4b-sft-v1) — 4B SFT model trained using this data |
| **RDS Benchmark** | [rare-archive-eval-rarearena-rds](https://huggingface.co/datasets/Wilhelm-Foundation/rare-archive-eval-rarearena-rds) — 8,562 evaluation cases |
| **RDC Benchmark** | [rare-archive-eval-rarearena-rdc](https://huggingface.co/datasets/Wilhelm-Foundation/rare-archive-eval-rarearena-rdc) — 4,376 cases with lab results |
| **Clinical Demo** | [rare-archive-clinical-demo](https://huggingface.co/spaces/Wilhelm-Foundation/rare-archive-clinical-demo) — Interactive demo Space |
| **Collection** | [Rare AI Archive — Complete Toolkit](https://huggingface.co/collections/Wilhelm-Foundation/rare-ai-archive-complete-toolkit-69c4b1e14800a370fe028851) |
| **GitHub** | [Wilhelm-Foundation/rare-archive](https://github.com/Wilhelm-Foundation/rare-archive) |
## Citation
```bibtex
@misc{rare-archive-synthetic-patients,
title={Rare Archive Synthetic Patients: Frequency-Weighted HPO Sampling for Rare Disease SFT},
author={Wilhelm Foundation},
year={2026},
publisher={HuggingFace},
url={https://huggingface.co/datasets/Wilhelm-Foundation/rare-archive-synthetic-patients}
}
```
*A program of the [Wilhelm Foundation](https://wilhelm.foundation). Built on [Lattice Protocol](https://github.com/LatticeProtocol). No disease is too rare to matter.*
提供机构:
Wilhelm-Foundation



