Unlocking Retrospective Prevalent Information in EHRs—A Revisit to the Pairwise Pseudolikelihood

Name: Unlocking Retrospective Prevalent Information in EHRs—A Revisit to the Pairwise Pseudolikelihood
Creator: Taylor & Francis
Published: 2025-01-03 16:40:40
License: 暂无描述

DataCite Commons2025-01-03 更新2025-05-07 收录

下载链接：

https://tandf.figshare.com/articles/dataset/Unlocking_Retrospective_Prevalent_Information_in_EHRs_-_a_Revisit_to_the_Pairwise_Pseudolikelihood/27702574/2

下载链接

链接失效反馈

官方服务：

资源简介：

Electronic health records offer abundant data on various diseases and health conditions, enabling researchers to explore the relationship between disease onset age and underlying risk factors. Unlike mortality data, the event of interest is nonterminal, hence, individuals can retrospectively report their disease-onset-age upon recruitment to the study. These individuals, diagnosed with the disease before entering the study, are termed “prevalent.” The ascertainment imposes a left truncation condition, also known as a “delayed entry,” because individuals had to survive a certain period before being eligible for enrollment. The standard method to accommodate delayed entry conditions on the entire history up to recruitment, hence, the retrospective prevalent failure times are conditioned upon and cannot participate in estimating the disease-onset-age distribution. Other methods that condition on less information and allow the incorporation of the prevalent observations either bring about numerical and computational difficulties or require statistical assumptions that are violated by most biobanks. This work presents a novel estimator of the coefficients in a regression model for the age-at-onset, successfully using the prevalent data. Asymptotic results are provided, and simulations are conducted to showcase the substantial efficiency gain. In particular, the method is highly useful in leveraging large-scale repositories for replication analysis of genetic variants. Indeed, analysis of urinary bladder cancer data reveals that the proposed approach yields about twice as many replicated discoveries compared to the popular approach. Supplementary materials for this article are available online, including a standardized description of the materials available for reproducing the work.

电子健康记录（Electronic Health Records, EHR）蕴藏着覆盖各类疾病与健康状态的海量数据，可为研究者探索疾病发病年龄与潜在风险因素间的关联提供支撑。与死亡数据（mortality data）不同，本研究关注的事件为非终末性事件，因此受试者可在参与研究时回顾性报告其疾病发病年龄。那些在入组前已确诊罹患目标疾病的受试者，被称为"现患病例（prevalent）"。由于受试者必须存活一定时长后方可符合研究入组资格，此种病例确认方式会引入左截断（left truncation）条件，亦称为"延迟入组（delayed entry）"。传统分析方法通过以招募前的完整病史为条件来适配延迟入组场景，但回顾性获取的现患病例发病时间会受该条件约束，无法参与疾病发病年龄分布的估计工作。其他以更少信息为条件且允许纳入现患病例数据的方法，要么会引发数值与计算层面的难题，要么需要依赖多数生物样本库（biobanks）无法满足的统计学假设。本研究提出了一种用于发病年龄回归模型的新型系数估计量，可成功利用现患病例数据开展分析。本文给出了该估计量的渐近结果，并通过仿真实验展示了该方法可实现显著的效率提升。具体而言，该方法在依托大规模数据库开展遗传变异（genetic variants）重复分析领域极具应用价值。对膀胱癌（urinary bladder cancer）数据集的实际分析结果显示，与当前主流方法相比，本文提出的方法可获得约两倍之多的重复验证阳性发现。本文的补充材料可在线获取，其中包含可用于复现本研究工作的标准化材料说明。

提供机构：

Taylor & Francis

创建时间：

2025-01-03