SPIDER - Synthetic Person Information Dataset for Entity Resolution
收藏Figshare2025-07-24 更新2026-04-08 收录
下载链接:
https://figshare.com/articles/dataset/SPIDER_-_Synthetic_Person_Information_Dataset_for_Entity_Resolution/29595599/2
下载链接
链接失效反馈官方服务:
资源简介:
SPIDER - Synthetic Person Information Dataset for Entity Resolution offers researchers with ready to use data that can be utilized in benchmarking Duplicate or Entity Resolution algorithms. The dataset is aimed at person-level fields that are typical in customer data. As it is hard to source real world person level data due to Personally Identifiable Information (PII), there are very few synthetic data available publicly. The current datasets also come with limitations of small volume and core person-level fields missing in the dataset. SPIDER addresses the challenges by focusing on core person level attributes - <i>first/last name, email, phone, address and dob</i>. Using Python Faker library, 40,000 unique, synthetic person records are created. An additional 10,000 duplicate records are generated from the base records using 7 real-world transformation rules. The duplicate records are labelled with original base record and the duplicate rule used for record generation through <i>is_duplicate_of </i>and <i>duplication_rule</i> fields<br><br><b>Duplicate Rules</b>Duplicate record with a variation in email address.Duplicate record with a variation in email addressDuplicate record with last name variationDuplicate record with first name variationDuplicate record with a nicknameDuplicate record with near exact spellingDuplicate record with only same email and name<b>Output Format</b>The dataset is presented in both JSON and CSV formats for use in data processing and machine learning tools.<br><br><b>Data Regeneration</b>The project includes the python script used for generating the 50,000 person records. The Python script can be expanded to include - additional duplicate rules, fuzzy name, geographical names' variations and volume adjustments.<br><br><b>Files Included</b>spider_dataset_20250714_035016.csvspider_dataset_20250714_035016.jsonspider_readme.mdDataDescriptionspythoncodeV1.py<br>
提供机构:
Arokiya Dass, Rose Mary; mathur, yash; Chinnappa, Praveen
创建时间:
2025-07-24



