Record Linkage Datasets
收藏Figshare2022-04-02 更新2026-04-08 收录
下载链接:
https://figshare.com/articles/dataset/Record_Linkage_Datasets/19500671/1
下载链接
链接失效反馈官方服务:
资源简介:
This simulated dataset is a corrupted segment from the Social Security Death Master File (SSDMF) available at https://ssdmf.info/. There are 11 original datasets: ``dsxo`` where `x` runs from `1...11` and the suffix `o` stands for `original`. The sizes (number of original records) of these datasets are as follows:<br>| dataset | size ||:----------:|:----:|| ds1o | 10K || ds2o | 20K | | ds3o | 40K || ds4o | 80K || ds5o | 120K || ds6o | 160K || ds7o | 200K || ds8o | 400K || ds9o | 600K || ds10o | 800K || ds11o | 1M |<br>These original records are then corrupted via a modified version of the `dsgen` Python script by `Peter Christen`.The modified/corrupted files are saved as: ``dsxm`` where the suffix `m` stands for `modified`.The modified records plus four original replicates are concatenated and mixed up (by the Linux command tool `shuf`).The resultant datasets are named: ``dsx.0`` ``(dsx.1)`` before(after) shuffling.So, the sizes of these datasets are as follows:<br>| dataset | size ||:-------:|:----:|| ds1.1 | 50k || ds2.1 | 100k || ds3.1 | 200k || ds4.1 | 400k || ds5.1 | 600k || ds6.1 | 800k || ds7.1 | 1M || ds8.1 | 2M || ds9.1 | 3M || ds10.1 | 4M || ds11.1 | 5M |<br>Furthermore, each dataset is split into two halves to serve as input for record linkage algorithms. For example, ds1.1 is split into ds1.1.1 & ds1.1.2.<br>
提供机构:
Soliman, Ahmed
创建时间:
2022-04-02



