five

Record Linkage Datasets

收藏
Figshare2022-04-02 更新2026-04-28 收录
下载链接:
https://figshare.com/articles/dataset/Record_Linkage_Datasets/19500671
下载链接
链接失效反馈
官方服务:
资源简介:
This simulated dataset is a corrupted segment from the Social Security Death Master File (SSDMF) available at https://ssdmf.info/. There are 11 original datasets: ``dsxo`` where `x` runs from `1...11` and the suffix `o` stands for `original`. The sizes (number of original records) of these datasets are as follows:| dataset | size ||:----------:|:----:|| ds1o | 10K || ds2o | 20K | | ds3o | 40K || ds4o | 80K || ds5o | 120K || ds6o | 160K || ds7o | 200K || ds8o | 400K || ds9o | 600K || ds10o | 800K || ds11o | 1M |These original records are then corrupted via a modified version of the `dsgen` Python script by `Peter Christen`.The modified/corrupted files are saved as: ``dsxm`` where the suffix `m` stands for `modified`.The modified records plus four original replicates are concatenated and mixed up (by the Linux command tool `shuf`).The resultant datasets are named: ``dsx.0`` ``(dsx.1)`` before(after) shuffling.So, the sizes of these datasets are as follows:| dataset | size ||:-------:|:----:|| ds1.1 | 50k || ds2.1 | 100k || ds3.1 | 200k || ds4.1 | 400k || ds5.1 | 600k || ds6.1 | 800k || ds7.1 | 1M || ds8.1 | 2M || ds9.1 | 3M || ds10.1 | 4M || ds11.1 | 5M |Furthermore, each dataset is split into two halves to serve as input for record linkage algorithms. For example, ds1.1 is split into ds1.1.1 & ds1.1.2.
创建时间:
2022-04-02
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作