PRIME-CVD Data Asset 1: DAG-Simulated Cardiovascular Risk Cohort for Medical Informatics Education
收藏NIAID Data Ecosystem2026-05-10 收录
下载链接:
https://figshare.com/articles/dataset/PRIME-CVD_Data_Asset_1_DAG-Simulated_Cardiovascular_Risk_Cohort_for_Medical_Informatics_Education/31395765
下载链接
链接失效反馈官方服务:
资源简介:
OverviewPRIME-CVD Data Asset 1 is a directed acyclic graph (DAG)-simulated cohort of 50,000 synthetic individuals designed to reproduce realistic demographic structure, socioeconomic gradients, cardiometabolic risk factor distributions, and clinically plausible five-year cardiovascular disease (CVD) incidence consistent with contemporary Australian primary prevention populations [1]. The dataset encodes established epidemiologic relationships among age, socioeconomic disadvantage (IRSD), behavioural risk factors, chronic disease states, biomarkers, and time-to-event outcomes within a transparent, parametrically specified causal framework.
SecurityAll individuals are simulated entirely de novo using a fully parameterised DAG configured from publicly available epidemiologic summaries (e.g., Australian Institute of Health and Welfare, Australian Bureau of Statistics, and peer-reviewed literature). No patient-level electronic medical record (EMR) data were used in model construction, and no machine learning generative models (e.g., GANs, diffusion models, or large language models) were trained on real clinical data.
Because the simulation is entirely mechanism-based rather than data-trained, there is no membership inference risk, no residual linkage risk, and no possibility of re-identification. None of the synthetic individuals correspond to real-world patients, and the dataset contains no direct identifiers, quasi-identifiers, or protected health information.
Educational FocusDespite being fully simulated, the dataset preserves realistic subgroup imbalance and clinically meaningful risk gradients, enabling applied training in epidemiology, medical informatics, and health data science without governance barriers.
PRIME-CVD Data Asset 1 is suitable for instruction in:
Cox proportional hazards modelling (survival analysis in epidemiology)Risk prediction model development and calibration assessmentClassification metrics (precision, recall, F1 score) for statistical interpretationDimensionality reduction techniques (e.g., t-SNE) for data visualisationDemographic and socioeconomic stratification for health policy analysisFairness-aware modelling and subgroup performance evaluationThis environment allows learners to develop analytic workflows and methodological competence prior to working with governed clinical datasets.
Reference
[1] Kuo NI-H, et al. Estimating 5-year absolute risk of cardiovascular disease using routinely collected electronic medical records from Australian general practices. Heart. 2025.
Synthetic Cohort Characteristics (N = 50,000)Age (years)
Mean (SD): 49.71 (12.37)
Median [IQR]: 49.63 [41.33, 58.09]
Range: 18.0–90.0
IRSD Quintile Distribution
Q1: 21.28% (Most disadvantaged)
Q2: 16.11%
Q3: 23.88%
Q4: 16.99%
Q5: 21.74% (Least disadvantaged)
Smoking Status
Non-smoker: 73.14%
Ex-smoker: 16.72%
Current smoker: 10.13%
Chronic Disease Prevalence
Diabetes mellitus: 7.43%
Chronic Kidney Disease (CKD): 0.680%
Atrial Fibrillation (AF): 0.720%
Body Mass Index (BMI, kg/m²)
Mean (SD): 28.33 (5.03)
Median [IQR]: 28.33 [24.92, 31.73]
Range: 15.0–52.76
Systolic Blood Pressure (SBP, mmHg)
Mean (SD): 123.31 (16.10)
Median [IQR]: 123.14 [112.39, 134.10]
Range: 55.85–187.79
Estimated Glomerular Filtration Rate (eGFR, mL/min/1.73m²)
Mean (SD): 82.77 (6.09)
Median [IQR]: 82.94 [79.22, 86.66]
Range: 37.00–104.65
Haemoglobin A1c (HbA1c, %)
Mean (SD): 4.79 (0.93)
Median [IQR]: 4.66 [4.24, 5.12]
Range: 2.23–12.71
Cardiovascular Outcomes
Overall 5-year CVD event rate: 4.02%
Mean follow-up time: 4.80 years
创建时间:
2026-02-23



