five

xpertsystems/hc01-t2d-sample

收藏
Hugging Face2026-04-18 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/xpertsystems/hc01-t2d-sample
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-nc-4.0 language: - en tags: - healthcare - diabetes - synthetic-data - ehr - clinical - type-2-diabetes - tabular - longitudinal pretty_name: "HC01 — Synthetic Type 2 Diabetes Dataset (Sample)" size_categories: - 10K<n<100K task_categories: - tabular-classification - tabular-regression - time-series-forecasting configs: - config_name: patient_master data_files: data/patient_master.csv - config_name: encounters data_files: data/patient_encounters.csv - config_name: medications data_files: data/medication_orders.csv - config_name: complications data_files: data/complications_registry.csv - config_name: labs data_files: data/lab_results_longitudinal.csv - config_name: population_summary data_files: data/population_summary.csv --- # HC01 — Synthetic Type 2 Diabetes Patient Dataset (Evaluation Sample) **Publisher:** [XpertSystems.ai](https://xpertsystems.ai) **SKU:** HC01 (sample) **Version:** 1.0.0 **License:** CC BY-NC 4.0 — non-commercial evaluation and research use only. Commercial use, redistribution, or derivative data products require a commercial license. **Full product:** Contact pradeep@xpertsystems.ai --- ## What this is A **500-patient evaluation slice** of the XpertSystems HC01 synthetic Type 2 Diabetes dataset, released for technical evaluation, academic research, and benchmarking. The full commercial product covers 25,000+ patients with complete statistical validation, ML feature packs, and a Grade A+ benchmark report. This sample is intended to let ML engineers, data scientists, and health-economics researchers verify the statistical fidelity and schema quality of the data before evaluating the full product. It is **not** sized for model training at production scale — rare events, long tails, and cross-cohort signal are materially underrepresented at 500 patients. ## What's included Six CSV files covering a 5-year patient journey: | File | Rows (approx.) | Description | |---|---|---| | `patient_master.csv` | 500 | One row per patient. Demographics, SDOH risk, diagnosis date, baseline biomarkers, comorbidities, insurance, care-site assignment. | | `patient_encounters.csv` | ~8,400 | Longitudinal encounters (office, telehealth, specialist, ED, inpatient, RPM, care management). Includes biomarkers at visit, ICD-10, provider, payer, copay, care-gap flags. | | `medication_orders.csv` | ~6,200 | Prescription orders across 10 T2D drug classes (metformin, SGLT2, GLP-1, insulin, etc.). Includes MPR/PDC adherence, prior authorization outcomes, formulary tier, titration and discontinuation events. | | `complications_registry.csv` | ~960 | Diabetic complications (nephropathy, retinopathy, neuropathy, CVD, amputation, etc.) with onset date, severity stage, referral and treatment flags. | | `lab_results_longitudinal.csv` | ~19,600 | HbA1c, fasting glucose, lipid panel, UACR, eGFR, and screening labs. Includes critical-value flags, follow-up lag, duplicate-lab and care-gap anomaly flags. | | `population_summary.csv` | ~600 | Care-site × quarter aggregates: panel size, utilization rates, population-level glycemic control, care-gap rates. | **Not included in this sample:** the simulation engine, ML feature pack, statistical validation report (`metrics.json`), benchmark scoring artifacts, and the full-volume dataset. ## Quick start ```python from datasets import load_dataset # Load any of the six tables patients = load_dataset("xpertsystems/hc01-t2d-sample", "patient_master") encounters = load_dataset("xpertsystems/hc01-t2d-sample", "encounters") labs = load_dataset("xpertsystems/hc01-t2d-sample", "labs") print(patients["train"][0]) ``` Or with pandas directly: ```python import pandas as pd from huggingface_hub import hf_hub_download path = hf_hub_download( repo_id="xpertsystems/hc01-t2d-sample", filename="data/patient_master.csv", repo_type="dataset", ) df = pd.read_csv(path) ``` ## Schema highlights **Entity keys:** `patient_id` (`PAT#######`) links all tables. `encounter_id`, `order_id`, `lab_id`, `complication_id` are unique per-row. `site_id` and `payer_id` link encounters to care sites and payers respectively. **Temporal structure:** A 5-year simulated observation window. Quarterly patient-state updates drive encounter, lab, and medication timing. Dates are ISO-format (`YYYY-MM-DD`). **Coding standards:** ICD-10-CM for diagnoses and complications; RxNorm-style codes for medications (representative, not authoritative); LOINC-aligned lab types. **Realism controls present in this sample:** - Anomaly flags on labs, encounters, and medication orders for data-quality testing - Duplicate-lab and care-gap anomalies at calibrated base rates - Prior-authorization denial cascades affecting adherence - Coverage disruption events with downstream adherence penalties - Death flags with dates (where applicable) - SDOH-driven adherence heterogeneity ## How this was generated HC01 is produced by a deterministic simulation engine that models a synthetic T2D patient population through a calibrated sequence of stochastic processes. Patient demographics, comorbidities, and social-determinant risk are sampled from distributions aligned to public U.S. population references (CDC National Diabetes Statistics Report, NHANES, HEDIS MY2023). Each patient's HbA1c trajectory is modeled as a mean-reverting stochastic process conditioned on adherence, treatment intensification, and seasonal variation. Medication adherence (MPR/PDC) is drawn from a Beta distribution and modified by prior-authorization outcomes, copay burden, and coverage disruptions. Complication incidence follows a proportional-hazards formulation with hazard ratios for HbA1c, disease duration, CKD stage, and blood pressure, calibrated to published rates. Encounter, lab, and medication-order streams are generated conditional on patient state at each quarter, with ordering rates aligned to HEDIS Comprehensive Diabetes Care benchmarks. A small, controlled fraction of anomalies (duplicate labs, implausible values, care gaps) is injected to support data-quality and anomaly-detection use cases. All simulation is deterministic under a fixed integer seed. The full commercial product ships with a 12-metric benchmark validation report certifying fidelity to published clinical and utilization targets (Grade A+ at default parameters). ## Methodology references - American Diabetes Association — Standards of Care 2024 - CDC — National Diabetes Statistics Report 2022 - UKPDS Outcomes Model 2 (Hayes et al., 2013) - NCQA HEDIS MY2023 — Comprehensive Diabetes Care - HCUP — National ED Survey / National Inpatient Sample - AHIP — Prior Authorization Survey 2023 - Nathan et al. (2008) — Estimated Average Glucose equation - CAP Q-Probes — Critical lab notification study ## Suggested evaluation workflow 1. **Schema & volume sanity check.** Load all six CSVs, confirm row counts and join integrity on `patient_id`. 2. **Distribution checks.** Verify baseline HbA1c mean (~8.2%), BMI distribution (~33.2 kg/m² mean), comorbidity prevalences, and insurance mix against the references above. 3. **Correlation checks.** HbA1c–BMI correlation (~0.28), HbA1c–complication incidence monotonicity, adherence–outcome relationships. 4. **Longitudinal behavior.** Plot individual HbA1c trajectories; verify mean reversion, seasonal component, and separation between adherent and non-adherent cohorts. 5. **Edge-case coverage.** Review anomaly flags, critical-lab follow-up patterns, prior-auth denial cascades. If the sample passes your evaluation, the full 25,000-patient product (plus ML feature pack and Grade A+ validation report) is available under commercial license. ## Citation If you use this sample in research or publication, please cite: > XpertSystems.ai (2026). *HC01 — Synthetic Type 2 Diabetes Patient Dataset (Evaluation Sample), v1.0.0.* https://xpertsystems.ai ## Contact - Commercial licensing / full product: **pradeep@xpertsystems.ai** - Technical questions: **pradeep@xpertsystems.ai** - Web: **https://xpertsystems.ai** ## License This sample is released under **Creative Commons Attribution–NonCommercial 4.0 International (CC BY-NC 4.0)**. You may use, share, and adapt the data for non-commercial research and evaluation purposes with attribution. Commercial use, redistribution as a data product, or inclusion in a commercial offering requires a separate commercial license from XpertSystems.ai. All records are **fully synthetic**. No real patient data, PHI, or PII is present. Not intended for clinical use.
提供机构:
xpertsystems
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作