five

samwell/synthea-ncd-instructions

收藏
Hugging Face2026-04-10 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/samwell/synthea-ncd-instructions
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 task_categories: - text-generation - question-answering language: - en tags: - medical - healthcare - ncd - diabetes - hypertension - clinical - ehr - synthetic - instruction-tuning - fine-tuning - gemma - llm size_categories: - 10K<n<100K pretty_name: Synthea NCD Risk Assessment Instructions --- # Synthea NCD Instructions Synthetic EHR-based instruction-tuning dataset for training LLMs to predict non-communicable disease (NCD) risk, specifically **Type 2 Diabetes** and **Hypertension**. ## Quick Start ```python from datasets import load_dataset dataset = load_dataset("samwell/synthea-ncd-instructions") # View a sample print(dataset["train"][0]) ``` ## Dataset Description This dataset contains instruction-tuning examples derived from synthetic patient records generated using [Synthea](https://github.com/synthetichealth/synthea). Each example presents a patient's clinical data and asks the model to assess their NCD risk. ### Why This Dataset? - **NCD burden**: Diabetes and hypertension affect billions globally, especially in LMICs - **Clinical decision support**: LLMs can help with risk stratification at scale - **Open & reproducible**: Fully synthetic, no privacy concerns, Apache 2.0 licensed - **Instruction-tuned format**: Ready for fine-tuning Gemma, Llama, Mistral, etc. ### Intended Use - Fine-tuning LLMs for clinical decision support - Training models for NCD risk stratification - Research in medical AI and clinical NLP - Educational purposes - Prototyping healthcare AI applications ### Not Intended For - Direct clinical use without validation - Replacing clinical judgment - Diagnosis or treatment decisions - Production deployment without clinician review ## Dataset Structure Each example contains: | Field | Description | |-------|-------------| | `instruction` | Task description asking for NCD risk assessment | | `input` | Patient record with demographics, vitals, labs, conditions, medications | | `output` | Structured risk assessment with risk levels and recommendations | ### Example ```json { "instruction": "Based on the following patient record, assess the risk of Type 2 diabetes and hypertension. Provide risk levels (LOW, MODERATE, HIGH, or DIAGNOSED) with supporting factors, and clinical recommendations.", "input": "Patient: 52yo Female\nVitals: BP 142/88 mmHg, BMI 29.4, Weight 78.2 kg\nLabs: Glucose 118 mg/dL, HbA1c 6.0%, Total Cholesterol 210 mg/dL\nActive conditions: Prediabetes\nMedications: None", "output": "## Risk Assessment\n\n**Type 2 Diabetes: MODERATE**\n- Fasting glucose 118 mg/dL (prediabetes range 100-125)\n- HbA1c 6.0% (prediabetes range 5.7-6.4%)\n- BMI 29.4 (overweight)\n- History of prediabetes\n\n**Hypertension: HIGH**\n- BP 142/88 mmHg (Stage 1 hypertension)\n\n## Recommendations\n1. Lifestyle counseling: diet modification, increase physical activity\n2. Recheck glucose/HbA1c in 3-6 months\n3. Confirm elevated BP on 2 separate occasions\n4. Consider initiating antihypertensive therapy" } ``` ### Risk Levels | Level | Description | |-------|-------------| | **LOW** | No significant risk factors identified | | **MODERATE** | Some risk factors present, lifestyle modification recommended | | **HIGH** | Multiple risk factors or abnormal values, further workup needed | | **DIAGNOSED** | Patient has confirmed diagnosis in their record | ## Data Splits | Split | Examples | Purpose | |-------|----------|---------| | train | ~40,000 | Model training | | val | ~5,000 | Hyperparameter tuning | | test | ~5,000 | Final evaluation | ## Clinical Parameters ### Observations Used **Vitals:** - Blood Pressure (systolic/diastolic) - BMI - Body Weight - Heart Rate **Labs:** - Fasting Glucose - HbA1c - Total Cholesterol, HDL, LDL, Triglycerides - Creatinine, eGFR ### Conditions Tracked | Condition | SNOMED Code | |-----------|-------------| | Prediabetes | 714628002 | | Type 2 Diabetes | 44054006 | | Essential Hypertension | 59621000 | | Diabetic Neuropathy | 368581000119106 | | Diabetic Retinopathy | 1551000119108 | | Diabetic Kidney Disease | 127013003 | ## Generation Process 1. **Synthea** generated synthetic patient populations with realistic disease progression 2. Patient records were filtered for those with relevant NCD observations 3. Risk assessments were generated using clinical guidelines: - ADA criteria for diabetes/prediabetes - ACC/AHA guidelines for hypertension 4. Data was formatted for instruction-tuning (Alpaca-style) ### Reproducibility The generation scripts are available at: [github.com/HopeOS/training](https://github.com/hopeos/training) ```bash # Generate synthetic patients ./run_synthea -p 50000 --exporter.csv.export=true # Transform to instruction format python synthea_to_instructions.py --input ./synthea/output/csv --output ./data ``` ## Limitations - **Synthetic data**: Does not capture all real-world clinical complexity - **US-based demographics**: Synthea defaults to US population characteristics - **Simplified risk model**: Does not include family history, lifestyle factors, or genetic risk - **English only**: All text is in English - **No longitudinal reasoning**: Each example is a snapshot, not a time-series ## Changelog ### v1.0.0 (April 2026) - Initial release - ~50,000 synthetic patients - Diabetes and hypertension risk assessment - Train/val/test splits (80/10/10) ## Roadmap Planned improvements (contributions welcome!): - [ ] **Ghana/African demographics**: Custom Synthea config for African population characteristics - [ ] **Additional NCDs**: Chronic kidney disease, cardiovascular disease, obesity - [ ] **Multilingual**: French, Twi, Hausa translations for West African context - [ ] **Longitudinal examples**: Multi-visit patient trajectories - [ ] **Family history**: Incorporate genetic risk factors - [ ] **Lifestyle factors**: Diet, exercise, smoking, alcohol - [ ] **Validated models**: Release fine-tuned Gemma/Llama checkpoints ## Contributing We welcome contributions! Here's how you can help: 1. **Report issues**: Found an error in the data? [Open an issue](https://huggingface.co/datasets/samwell/synthea-ncd-instructions/discussions) 2. **Improve generation**: Submit PRs to the generation scripts 3. **Add demographics**: Help create Synthea configs for other regions 4. **Validate clinically**: Are you a clinician? Help us review the risk assessments 5. **Translate**: Help translate to other languages ## Citation ```bibtex @dataset{synthea_ncd_instructions_2026, title={Synthea NCD Instructions: A Synthetic Dataset for Clinical Risk Assessment}, author={samwell}, year={2026}, publisher={Hugging Face}, url={https://huggingface.co/datasets/samwell/synthea-ncd-instructions}, note={Living dataset - check for updates} } ``` ## License Apache 2.0 - free to use, modify, and distribute with attribution. ## Acknowledgments - [Synthea](https://github.com/synthetichealth/synthea) - Synthetic patient generation - [Unsloth](https://github.com/unslothai/unsloth) - Efficient fine-tuning - Clinical guidelines: ADA, ACC/AHA, WHO - HopeOS team for the initial implementation ## Contact - **Maintainer**: [@samwell](https://huggingface.co/samwell) - **Discussions**: [Dataset discussions](https://huggingface.co/datasets/samwell/synthea-ncd-instructions/discussions) - **Issues**: Report data quality issues in discussions --- *Last updated: April 2026*
提供机构:
samwell
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作