five

Fast Bayesian Record Linkage for Streaming Data Contexts

收藏
NIAID Data Ecosystem2026-05-01 收录
下载链接:
https://figshare.com/articles/dataset/Fast_Bayesian_Record_Linkage_for_Streaming_Data_Contexts/24565758
下载链接
链接失效反馈
官方服务:
资源简介:
Record linkage is the task of combining records from multiple files which refer to overlapping sets of entities when there is no unique identifying field. In streaming record linkage, files arrive sequentially in time and estimates of links are updated after the arrival of each file. This problem arises in settings such as longitudinal surveys, electronic health records, and online events databases, among others. The challenge in streaming record linkage is to efficiently update parameter estimates as new data arrive. We approach the problem from a Bayesian perspective with estimates calculated from posterior samples of parameters and present methods for updating link estimates after the arrival of a new file that are faster than fitting a joint model with each new data file. In this article, we generalize a two-file Bayesian Fellegi-Sunter model to the multi-file case and propose two methods to perform streaming updates. We examine the effect of prior distribution on the resulting linkage accuracy as well as the computational tradeoffs between the methods when compared to a Gibbs sampler through simulated and real-world survey panel data. We achieve near-equivalent posterior inference at a small fraction of the compute time. Supplementary materials for this article are available online.

记录链接(Record Linkage)指的是在不存在唯一标识字段的情况下,将来自多个文件、指向重叠实体集合的记录进行合并的任务。在流式记录链接场景中,文件按时间顺序依次到达,且在每个文件抵达后需更新链接估计结果。该问题常见于纵向调查、电子健康记录、在线事件数据库等多种应用场景。流式记录链接所面临的核心挑战,在于如何在新数据到达时高效更新参数估计值。我们从贝叶斯视角出发,基于参数的后验样本计算估计值,并提出了在新文件到达后更新链接估计的方法,该方法相较于每次引入新数据文件时拟合联合模型的方式更为高效。本文将双文件贝叶斯费尔利-桑特模型(Fellegi-Sunter Model)推广至多文件场景,并提出了两种用于流式更新的方法。通过模拟数据与真实世界追踪调查面板数据,我们将所提方法与吉布斯采样器(Gibbs Sampler)进行对比,分析了先验分布对最终链接精度的影响,以及各方法间的计算权衡问题。我们仅需原本极小一部分的计算时间,即可实现近乎等价的后验推断。本文的补充材料可在线获取。
创建时间:
2023-11-15
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作