SPIDER (v2): Synthetic Person Information Dataset for Entity Resolution
收藏DataCite Commons2025-12-31 更新2026-04-25 收录
下载链接:
https://figshare.com/articles/dataset/SPIDER_v2_Synthetic_Person_Information_Dataset_for_Entity_Resolution/30472712/1
下载链接
链接失效反馈官方服务:
资源简介:
SPIDER (v2) – Synthetic Person Information Dataset for Entity Resolution provides researchers with ready-to-use data for benchmarking Duplicate or Entity Resolution algorithms. The dataset focuses on person-level fields typical in customer or citizen records. Since real-world person-level data is restricted due to Personally Identifiable Information (PII) constraints, publicly available synthetic datasets are limited in scope, volume, or realism.SPIDER addresses these limitations by providing a large-scale, realistic dataset containing <b>first name, last name, email, phone, address, and date of birth (DOB)</b> attributes. Using the Python <b>Faker</b> library, <b>40,000 unique synthetic person records</b> were generated, followed by <b>10,000 controlled duplicate records</b> derived using <b>seven real-world transformation rules</b>. Each duplicate record is linked to its original base record and rule through the fields <b>is_duplicate_of</b> and <b>duplication_rule</b>.Version 2 introduces major realism and structural improvements, enhancing both the dataset and generation framework.<b>Enhancements in Version 2</b><b>New</b><b> </b><code><strong>cluster_id</strong></code><b> </b><b>column</b> to group base and duplicate records for improved entity-level benchmarking.<b>Improved data realism</b> with consistent field relationships:State and ZIP codes now match correctly.Phone numbers are generated based on state codes.Email addresses are logically related to name components.<b>Refined duplication logic</b>:Rule 4 updated for realistic address variation.Rule 7 enhanced to simulate shared accounts among different individuals (with distinct DOBs).<b>Improved data validation and formatting</b> for address, email, and date fields.<b>Updated Python generation script</b> for modular configuration, reproducibility, and extensibility.<b>Duplicate Rules (with real-world use cases)</b><b>Duplicate record with a variation in email address.</b><br><i>Use case:</i> Same person using multiple email accounts.<b>Duplicate record with a variation in phone numbers.</b><br><i>Use case:</i> Same person using multiple contact numbers.<b>Duplicate record with last-name variation.</b><br><i>Use case:</i> Name changes or data entry inconsistencies.<b>Duplicate record with address variation.</b><br><i>Use case:</i> Same person maintaining multiple addresses or moving residences.<b>Duplicate record with a nickname.</b><br><i>Use case:</i> Same person using formal and informal names (Robert → Bob, Elizabeth → Liz).<b>Duplicate record with minor spelling variations in the first name.</b><br><i>Use case:</i> Legitimate entry or migration errors (Sara → Sarah).<b>Duplicate record with multiple individuals sharing the same email and last name but different DOBs.</b><br><i>Use case:</i> Realistic shared accounts among family members or households (benefits, tax, or insurance portals).<b>Output Format</b>The dataset is available in both <b>CSV</b> and <b>JSON</b> formats for direct use in data-processing, machine-learning, and record-linkage frameworks.<br><b>Data Regeneration</b>The included <b>Python script</b> can be used to fully regenerate the dataset and supports:Addition of new duplication rulesRegional, linguistic, or domain-specific variationsVolume scaling for large-scale testing scenarios<b>Files Included</b><code>spider_dataset_v2_6_20251027_022215.csv</code><code>spider_dataset_v2_6_20251027_022215.json</code><code>spider_readme_v2.md</code><code>SPIDER_generation_script_v2.py</code><code><strong>SupportingDocuments/</strong></code><b> </b><b>folder containing:</b><code>benchmark_comparison_script.py</code> – script used for derive F-1 score.<code>Public_census_data_surname.csv</code> – sample U.S. Census name and demographic data used for comparison.<code>ssa_firstnames.csv</code> – Social Security Administration names dataset.<code>simplemaps_uszips.csv</code> – ZIP-to-state mapping data used for phone and address validation.
提供机构:
figshare
创建时间:
2025-10-29



