xpertsystems/hc01-t2d-sample
收藏Hugging Face2026-04-18 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/xpertsystems/hc01-t2d-sample
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-nc-4.0
language:
- en
tags:
- healthcare
- diabetes
- synthetic-data
- ehr
- clinical
- type-2-diabetes
- tabular
- longitudinal
pretty_name: "HC01 — Synthetic Type 2 Diabetes Dataset (Sample)"
size_categories:
- 10K<n<100K
task_categories:
- tabular-classification
- tabular-regression
- time-series-forecasting
configs:
- config_name: patient_master
data_files: data/patient_master.csv
- config_name: encounters
data_files: data/patient_encounters.csv
- config_name: medications
data_files: data/medication_orders.csv
- config_name: complications
data_files: data/complications_registry.csv
- config_name: labs
data_files: data/lab_results_longitudinal.csv
- config_name: population_summary
data_files: data/population_summary.csv
---
# HC01 — Synthetic Type 2 Diabetes Patient Dataset (Evaluation Sample)
**Publisher:** [XpertSystems.ai](https://xpertsystems.ai)
**SKU:** HC01 (sample)
**Version:** 1.0.0
**License:** CC BY-NC 4.0 — non-commercial evaluation and research use only. Commercial use, redistribution, or derivative data products require a commercial license.
**Full product:** Contact pradeep@xpertsystems.ai
---
## What this is
A **500-patient evaluation slice** of the XpertSystems HC01 synthetic Type 2 Diabetes dataset, released for technical evaluation, academic research, and benchmarking. The full commercial product covers 25,000+ patients with complete statistical validation, ML feature packs, and a Grade A+ benchmark report.
This sample is intended to let ML engineers, data scientists, and health-economics researchers verify the statistical fidelity and schema quality of the data before evaluating the full product. It is **not** sized for model training at production scale — rare events, long tails, and cross-cohort signal are materially underrepresented at 500 patients.
## What's included
Six CSV files covering a 5-year patient journey:
| File | Rows (approx.) | Description |
|---|---|---|
| `patient_master.csv` | 500 | One row per patient. Demographics, SDOH risk, diagnosis date, baseline biomarkers, comorbidities, insurance, care-site assignment. |
| `patient_encounters.csv` | ~8,400 | Longitudinal encounters (office, telehealth, specialist, ED, inpatient, RPM, care management). Includes biomarkers at visit, ICD-10, provider, payer, copay, care-gap flags. |
| `medication_orders.csv` | ~6,200 | Prescription orders across 10 T2D drug classes (metformin, SGLT2, GLP-1, insulin, etc.). Includes MPR/PDC adherence, prior authorization outcomes, formulary tier, titration and discontinuation events. |
| `complications_registry.csv` | ~960 | Diabetic complications (nephropathy, retinopathy, neuropathy, CVD, amputation, etc.) with onset date, severity stage, referral and treatment flags. |
| `lab_results_longitudinal.csv` | ~19,600 | HbA1c, fasting glucose, lipid panel, UACR, eGFR, and screening labs. Includes critical-value flags, follow-up lag, duplicate-lab and care-gap anomaly flags. |
| `population_summary.csv` | ~600 | Care-site × quarter aggregates: panel size, utilization rates, population-level glycemic control, care-gap rates. |
**Not included in this sample:** the simulation engine, ML feature pack, statistical validation report (`metrics.json`), benchmark scoring artifacts, and the full-volume dataset.
## Quick start
```python
from datasets import load_dataset
# Load any of the six tables
patients = load_dataset("xpertsystems/hc01-t2d-sample", "patient_master")
encounters = load_dataset("xpertsystems/hc01-t2d-sample", "encounters")
labs = load_dataset("xpertsystems/hc01-t2d-sample", "labs")
print(patients["train"][0])
```
Or with pandas directly:
```python
import pandas as pd
from huggingface_hub import hf_hub_download
path = hf_hub_download(
repo_id="xpertsystems/hc01-t2d-sample",
filename="data/patient_master.csv",
repo_type="dataset",
)
df = pd.read_csv(path)
```
## Schema highlights
**Entity keys:** `patient_id` (`PAT#######`) links all tables. `encounter_id`, `order_id`, `lab_id`, `complication_id` are unique per-row. `site_id` and `payer_id` link encounters to care sites and payers respectively.
**Temporal structure:** A 5-year simulated observation window. Quarterly patient-state updates drive encounter, lab, and medication timing. Dates are ISO-format (`YYYY-MM-DD`).
**Coding standards:** ICD-10-CM for diagnoses and complications; RxNorm-style codes for medications (representative, not authoritative); LOINC-aligned lab types.
**Realism controls present in this sample:**
- Anomaly flags on labs, encounters, and medication orders for data-quality testing
- Duplicate-lab and care-gap anomalies at calibrated base rates
- Prior-authorization denial cascades affecting adherence
- Coverage disruption events with downstream adherence penalties
- Death flags with dates (where applicable)
- SDOH-driven adherence heterogeneity
## How this was generated
HC01 is produced by a deterministic simulation engine that models a synthetic T2D patient population through a calibrated sequence of stochastic processes. Patient demographics, comorbidities, and social-determinant risk are sampled from distributions aligned to public U.S. population references (CDC National Diabetes Statistics Report, NHANES, HEDIS MY2023). Each patient's HbA1c trajectory is modeled as a mean-reverting stochastic process conditioned on adherence, treatment intensification, and seasonal variation. Medication adherence (MPR/PDC) is drawn from a Beta distribution and modified by prior-authorization outcomes, copay burden, and coverage disruptions. Complication incidence follows a proportional-hazards formulation with hazard ratios for HbA1c, disease duration, CKD stage, and blood pressure, calibrated to published rates. Encounter, lab, and medication-order streams are generated conditional on patient state at each quarter, with ordering rates aligned to HEDIS Comprehensive Diabetes Care benchmarks. A small, controlled fraction of anomalies (duplicate labs, implausible values, care gaps) is injected to support data-quality and anomaly-detection use cases.
All simulation is deterministic under a fixed integer seed. The full commercial product ships with a 12-metric benchmark validation report certifying fidelity to published clinical and utilization targets (Grade A+ at default parameters).
## Methodology references
- American Diabetes Association — Standards of Care 2024
- CDC — National Diabetes Statistics Report 2022
- UKPDS Outcomes Model 2 (Hayes et al., 2013)
- NCQA HEDIS MY2023 — Comprehensive Diabetes Care
- HCUP — National ED Survey / National Inpatient Sample
- AHIP — Prior Authorization Survey 2023
- Nathan et al. (2008) — Estimated Average Glucose equation
- CAP Q-Probes — Critical lab notification study
## Suggested evaluation workflow
1. **Schema & volume sanity check.** Load all six CSVs, confirm row counts and join integrity on `patient_id`.
2. **Distribution checks.** Verify baseline HbA1c mean (~8.2%), BMI distribution (~33.2 kg/m² mean), comorbidity prevalences, and insurance mix against the references above.
3. **Correlation checks.** HbA1c–BMI correlation (~0.28), HbA1c–complication incidence monotonicity, adherence–outcome relationships.
4. **Longitudinal behavior.** Plot individual HbA1c trajectories; verify mean reversion, seasonal component, and separation between adherent and non-adherent cohorts.
5. **Edge-case coverage.** Review anomaly flags, critical-lab follow-up patterns, prior-auth denial cascades.
If the sample passes your evaluation, the full 25,000-patient product (plus ML feature pack and Grade A+ validation report) is available under commercial license.
## Citation
If you use this sample in research or publication, please cite:
> XpertSystems.ai (2026). *HC01 — Synthetic Type 2 Diabetes Patient Dataset (Evaluation Sample), v1.0.0.* https://xpertsystems.ai
## Contact
- Commercial licensing / full product: **pradeep@xpertsystems.ai**
- Technical questions: **pradeep@xpertsystems.ai**
- Web: **https://xpertsystems.ai**
## License
This sample is released under **Creative Commons Attribution–NonCommercial 4.0 International (CC BY-NC 4.0)**. You may use, share, and adapt the data for non-commercial research and evaluation purposes with attribution. Commercial use, redistribution as a data product, or inclusion in a commercial offering requires a separate commercial license from XpertSystems.ai.
All records are **fully synthetic**. No real patient data, PHI, or PII is present. Not intended for clinical use.
提供机构:
xpertsystems



