PRIME-CVD Data Asset 2: Relational EMR-Style Cardiovascular Dataset for Medical Informatics Education

Figshare2026-02-24 更新2026-04-28 收录

下载链接：

https://figshare.com/articles/dataset/PRIME-CVD_Data_Asset_2_Relational_EMR-Style_Cardiovascular_Dataset_for_Medical_Informatics_Education/31403028

下载链接

链接失效反馈

官方服务：

资源简介：

OverviewPRIME-CVD Data Asset 2 is an EMR-style relational dataset derived deterministically from PRIME-CVD Data Asset 1 [1]. It represents a cohort of 50,000 simulated adults undergoing primary prevention for cardiovascular disease (CVD), but restructures the clean cohort into a multi-table patient database designed to mimic the structural and lexical messiness of real general-practice EMR systems.Rather than providing a single tidy modelling table, Data Asset 2 forces realistic workflows: linkage, harmonisation, unit handling, and cohort reconstruction — exactly the stuff learners need before they touch governed clinical datasets.This asset supports teaching and assessment in health data engineering, medical informatics, and applied epidemiology, without privacy or governance barriers.What’s in the ZIPThe zip contains three linked CSV tables (linked via Patient_ID):PatientMasterSummary.csvOne row per patient: demographics, socioeconomic status (IRSD), smoking status (with injected missingness), and coarsened CVD outcome timing.PatientChronicDiseases.csvOne-to-many diagnosis records: diabetes, CKD, AF represented using heterogeneous free-text and code-like labels.PatientMeasAndPath.csvLong-form measurements (e.g., BMI / SBP / eGFR / HbA1c) with variable naming inconsistency and mixed units.Note: All three tables include an Unnamed: 0 column (a benign saved index). Users can safely drop it.SecurityAll individuals are simulated entirely de novo using a fully parameterised DAG configured from publicly available epidemiologic summaries (e.g., Australian Institute of Health and Welfare, Australian Bureau of Statistics, and peer-reviewed literature). No patient-level electronic medical record (EMR) data were used in model construction, and no machine learning generative models (e.g., GANs, diffusion models, or large language models) were trained on real clinical data.Because the simulation is entirely mechanism-based rather than data-trained, there is no membership inference risk, no residual linkage risk, and no possibility of re-identification. None of the synthetic individuals correspond to real-world patients, and the dataset contains no direct identifiers, quasi-identifiers, or protected health information.Educational FocusData Asset 2 is designed to teach real-world EMR analytics, including:Relational data linkage using patient identifiersCohort reconstruction from multi-table EMR structureClinical label harmonisationHandling missingnessUnit standardisationFeature engineering from long-form measurement tablesRobust subgroup / equity-aware analysis by IRSDEnd-to-end pipelines that resemble real primary-care workflowsThis lets learners develop competence in realistic EMR workflows before transitioning to governed clinical data.Reference[A] Kuo NI-H (2026). PRIME-CVD Data Asset 1: DAG-Simulated Cardiovascular Risk Cohort for Medical Informatics Education. figshare. Dataset. https://doi.org/10.6084/m9.figshare.31395765.v1

概述 PRIME-CVD 数据集2（PRIME-CVD Data Asset 2）是一款基于电子病历（Electronic Medical Record, EMR）格式的关系型数据集，由PRIME-CVD 数据集1（PRIME-CVD Data Asset 1）[1] 确定性衍生得到。该数据集包含50000名接受心血管疾病（Cardiovascular Disease, CVD）一级预防的模拟成人队列，其将规整的原始队列重构为多表患者数据库，用以模拟真实全科医疗电子病历系统的结构与词汇杂乱性。与单一标准化建模数据表不同，本数据集还原了真实临床工作流程：数据关联、标签协调、单位处理以及队列重构——这些正是学习者接触受管控临床数据集前必须掌握的核心技能。本数据集可用于健康数据工程、医学信息学与应用流行病学领域的教学与评估，且无隐私或管控壁垒。压缩包内容说明该压缩包包含三张通过Patient_ID（患者ID）关联的CSV数据表： 1. PatientMasterSummary.csv：每位患者对应一行记录，涵盖人口统计学信息、社会经济地位（IRSD）、吸烟状态（含注入式缺失值）以及粗粒度化的CVD结局发生时间。 2. PatientChronicDiseases.csv：一对多诊断记录数据集，涵盖糖尿病、慢性肾脏病（Chronic Kidney Disease, CKD）、心房颤动（Atrial Fibrillation, AF），采用异构的自由文本与类编码标签进行标注。 3. PatientMeasAndPath.csv：长格式测量数据表（例如体重指数（Body Mass Index, BMI）/收缩压（Systolic Blood Pressure, SBP）/估算肾小球滤过率（estimated Glomerular Filtration Rate, eGFR）/糖化血红蛋白（Hemoglobin A1c, HbA1c）），存在变量命名不一致与单位混用的问题。注意事项三张数据表均包含Unnamed: 0列（无害的保存索引），用户可安全删除该列。安全性说明所有个体均通过基于公开流行病学汇总数据（例如澳大利亚健康与福利研究所、澳大利亚统计局以及同行评议文献）配置的全参数化有向无环图（Directed Acyclic Graph, DAG）全新模拟生成。数据集构建过程未使用任何患者级别的电子病历数据，也未基于真实临床数据训练任何机器学习生成模型（例如生成对抗网络（Generative Adversarial Networks, GANs）、扩散模型或大语言模型（Large Language Model, LLM））。由于本模拟完全基于机制驱动而非数据训练，因此不存在成员推断风险、残余关联风险，也无法实现重识别。所有合成个体均不对应真实世界患者，且数据集未包含直接标识符、准标识符或受保护的健康信息。教学目标本数据集旨在教授真实场景下的电子病历分析技能，具体包括： - 基于患者标识符的关系型数据关联 - 从多表电子病历结构中重构队列 - 临床标签标准化协调 - 缺失值处理 - 单位标准化 - 从长格式测量数据表中开展特征工程 - 基于IRSD的稳健亚组/公平性分析 - 贴合真实全科医疗工作流程的端到端数据管线这使得学习者在接触受管控的临床数据前，即可掌握真实电子病历工作流程相关的实操能力。参考文献 [A] Kuo NI-H (2026). PRIME-CVD Data Asset 1: 用于医学信息学教育的有向无环图模拟心血管风险队列. figshare. 数据集. https://doi.org/10.6084/m9.figshare.31395765.v1

创建时间：

2026-02-24

5,000+

优质数据集

54 个

任务类型

进入经典数据集