Fast Bayesian Record Linkage for Streaming Data Contexts
收藏DataCite Commons2024-01-03 更新2024-08-18 收录
下载链接:
https://tandf.figshare.com/articles/dataset/Fast_Bayesian_Record_Linkage_for_Streaming_Data_Contexts/24565758
下载链接
链接失效反馈官方服务:
资源简介:
Record linkage is the task of combining records from multiple files which refer to overlapping sets of entities when there is no unique identifying field. In streaming record linkage, files arrive sequentially in time and estimates of links are updated after the arrival of each file. This problem arises in settings such as longitudinal surveys, electronic health records, and online events databases, among others. The challenge in streaming record linkage is to efficiently update parameter estimates as new data arrive. We approach the problem from a Bayesian perspective with estimates calculated from posterior samples of parameters and present methods for updating link estimates after the arrival of a new file that are faster than fitting a joint model with each new data file. In this article, we generalize a two-file Bayesian Fellegi-Sunter model to the multi-file case and propose two methods to perform streaming updates. We examine the effect of prior distribution on the resulting linkage accuracy as well as the computational tradeoffs between the methods when compared to a Gibbs sampler through simulated and real-world survey panel data. We achieve near-equivalent posterior inference at a small fraction of the compute time. Supplementary materials for this article are available online.
记录链接(Record Linkage)是指在不存在唯一标识字段的情况下,将来自多个文件且指向重叠实体集合的记录进行合并的任务。流式记录链接(Streaming Record Linkage)指文件按时间顺序依次抵达,且需在每个文件到达后更新链接估计值的任务。该问题常见于纵向调查、电子健康记录以及在线事件数据库等应用场景中。流式记录链接的核心挑战在于,需在新数据抵达时高效更新参数估计值。本研究从贝叶斯视角出发,基于参数后验样本计算估计值,并提出了在新文件到达后更新链接估计值的方法,该方法相较于针对每个新数据文件拟合联合模型的方式更为高效。本文将双文件贝叶斯Fellegi-Sunter模型推广至多文件场景,并提出了两种用于流式更新的方法。本文通过模拟数据与真实世界的追踪调查面板数据,与吉布斯采样器(Gibbs Sampler)进行对比,分析了先验分布对最终链接精度的影响,以及各方法间的计算权衡问题。本方法仅需原计算时长的极小一部分,即可实现近乎等价的后验推断。本文的补充材料可在线获取。
提供机构:
Taylor & Francis
创建时间:
2023-11-15



