samwell/synthea-ncd-instructions
收藏Hugging Face2026-04-10 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/samwell/synthea-ncd-instructions
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
task_categories:
- text-generation
- question-answering
language:
- en
tags:
- medical
- healthcare
- ncd
- diabetes
- hypertension
- clinical
- ehr
- synthetic
- instruction-tuning
- fine-tuning
- gemma
- llm
size_categories:
- 10K<n<100K
pretty_name: Synthea NCD Risk Assessment Instructions
---
# Synthea NCD Instructions
Synthetic EHR-based instruction-tuning dataset for training LLMs to predict non-communicable disease (NCD) risk, specifically **Type 2 Diabetes** and **Hypertension**.
## Quick Start
```python
from datasets import load_dataset
dataset = load_dataset("samwell/synthea-ncd-instructions")
# View a sample
print(dataset["train"][0])
```
## Dataset Description
This dataset contains instruction-tuning examples derived from synthetic patient records generated using [Synthea](https://github.com/synthetichealth/synthea). Each example presents a patient's clinical data and asks the model to assess their NCD risk.
### Why This Dataset?
- **NCD burden**: Diabetes and hypertension affect billions globally, especially in LMICs
- **Clinical decision support**: LLMs can help with risk stratification at scale
- **Open & reproducible**: Fully synthetic, no privacy concerns, Apache 2.0 licensed
- **Instruction-tuned format**: Ready for fine-tuning Gemma, Llama, Mistral, etc.
### Intended Use
- Fine-tuning LLMs for clinical decision support
- Training models for NCD risk stratification
- Research in medical AI and clinical NLP
- Educational purposes
- Prototyping healthcare AI applications
### Not Intended For
- Direct clinical use without validation
- Replacing clinical judgment
- Diagnosis or treatment decisions
- Production deployment without clinician review
## Dataset Structure
Each example contains:
| Field | Description |
|-------|-------------|
| `instruction` | Task description asking for NCD risk assessment |
| `input` | Patient record with demographics, vitals, labs, conditions, medications |
| `output` | Structured risk assessment with risk levels and recommendations |
### Example
```json
{
"instruction": "Based on the following patient record, assess the risk of Type 2 diabetes and hypertension. Provide risk levels (LOW, MODERATE, HIGH, or DIAGNOSED) with supporting factors, and clinical recommendations.",
"input": "Patient: 52yo Female\nVitals: BP 142/88 mmHg, BMI 29.4, Weight 78.2 kg\nLabs: Glucose 118 mg/dL, HbA1c 6.0%, Total Cholesterol 210 mg/dL\nActive conditions: Prediabetes\nMedications: None",
"output": "## Risk Assessment\n\n**Type 2 Diabetes: MODERATE**\n- Fasting glucose 118 mg/dL (prediabetes range 100-125)\n- HbA1c 6.0% (prediabetes range 5.7-6.4%)\n- BMI 29.4 (overweight)\n- History of prediabetes\n\n**Hypertension: HIGH**\n- BP 142/88 mmHg (Stage 1 hypertension)\n\n## Recommendations\n1. Lifestyle counseling: diet modification, increase physical activity\n2. Recheck glucose/HbA1c in 3-6 months\n3. Confirm elevated BP on 2 separate occasions\n4. Consider initiating antihypertensive therapy"
}
```
### Risk Levels
| Level | Description |
|-------|-------------|
| **LOW** | No significant risk factors identified |
| **MODERATE** | Some risk factors present, lifestyle modification recommended |
| **HIGH** | Multiple risk factors or abnormal values, further workup needed |
| **DIAGNOSED** | Patient has confirmed diagnosis in their record |
## Data Splits
| Split | Examples | Purpose |
|-------|----------|---------|
| train | ~40,000 | Model training |
| val | ~5,000 | Hyperparameter tuning |
| test | ~5,000 | Final evaluation |
## Clinical Parameters
### Observations Used
**Vitals:**
- Blood Pressure (systolic/diastolic)
- BMI
- Body Weight
- Heart Rate
**Labs:**
- Fasting Glucose
- HbA1c
- Total Cholesterol, HDL, LDL, Triglycerides
- Creatinine, eGFR
### Conditions Tracked
| Condition | SNOMED Code |
|-----------|-------------|
| Prediabetes | 714628002 |
| Type 2 Diabetes | 44054006 |
| Essential Hypertension | 59621000 |
| Diabetic Neuropathy | 368581000119106 |
| Diabetic Retinopathy | 1551000119108 |
| Diabetic Kidney Disease | 127013003 |
## Generation Process
1. **Synthea** generated synthetic patient populations with realistic disease progression
2. Patient records were filtered for those with relevant NCD observations
3. Risk assessments were generated using clinical guidelines:
- ADA criteria for diabetes/prediabetes
- ACC/AHA guidelines for hypertension
4. Data was formatted for instruction-tuning (Alpaca-style)
### Reproducibility
The generation scripts are available at: [github.com/HopeOS/training](https://github.com/hopeos/training)
```bash
# Generate synthetic patients
./run_synthea -p 50000 --exporter.csv.export=true
# Transform to instruction format
python synthea_to_instructions.py --input ./synthea/output/csv --output ./data
```
## Limitations
- **Synthetic data**: Does not capture all real-world clinical complexity
- **US-based demographics**: Synthea defaults to US population characteristics
- **Simplified risk model**: Does not include family history, lifestyle factors, or genetic risk
- **English only**: All text is in English
- **No longitudinal reasoning**: Each example is a snapshot, not a time-series
## Changelog
### v1.0.0 (April 2026)
- Initial release
- ~50,000 synthetic patients
- Diabetes and hypertension risk assessment
- Train/val/test splits (80/10/10)
## Roadmap
Planned improvements (contributions welcome!):
- [ ] **Ghana/African demographics**: Custom Synthea config for African population characteristics
- [ ] **Additional NCDs**: Chronic kidney disease, cardiovascular disease, obesity
- [ ] **Multilingual**: French, Twi, Hausa translations for West African context
- [ ] **Longitudinal examples**: Multi-visit patient trajectories
- [ ] **Family history**: Incorporate genetic risk factors
- [ ] **Lifestyle factors**: Diet, exercise, smoking, alcohol
- [ ] **Validated models**: Release fine-tuned Gemma/Llama checkpoints
## Contributing
We welcome contributions! Here's how you can help:
1. **Report issues**: Found an error in the data? [Open an issue](https://huggingface.co/datasets/samwell/synthea-ncd-instructions/discussions)
2. **Improve generation**: Submit PRs to the generation scripts
3. **Add demographics**: Help create Synthea configs for other regions
4. **Validate clinically**: Are you a clinician? Help us review the risk assessments
5. **Translate**: Help translate to other languages
## Citation
```bibtex
@dataset{synthea_ncd_instructions_2026,
title={Synthea NCD Instructions: A Synthetic Dataset for Clinical Risk Assessment},
author={samwell},
year={2026},
publisher={Hugging Face},
url={https://huggingface.co/datasets/samwell/synthea-ncd-instructions},
note={Living dataset - check for updates}
}
```
## License
Apache 2.0 - free to use, modify, and distribute with attribution.
## Acknowledgments
- [Synthea](https://github.com/synthetichealth/synthea) - Synthetic patient generation
- [Unsloth](https://github.com/unslothai/unsloth) - Efficient fine-tuning
- Clinical guidelines: ADA, ACC/AHA, WHO
- HopeOS team for the initial implementation
## Contact
- **Maintainer**: [@samwell](https://huggingface.co/samwell)
- **Discussions**: [Dataset discussions](https://huggingface.co/datasets/samwell/synthea-ncd-instructions/discussions)
- **Issues**: Report data quality issues in discussions
---
*Last updated: April 2026*
提供机构:
samwell



