SPIDER (v2): Synthetic Person Information Dataset for Entity Resolution

Name: SPIDER (v2): Synthetic Person Information Dataset for Entity Resolution
Creator: figshare
Published: 2025-12-31 06:56:47
License: 暂无描述

DataCite Commons2025-12-31 更新2026-04-25 收录

下载链接：

https://figshare.com/articles/dataset/SPIDER_v2_Synthetic_Person_Information_Dataset_for_Entity_Resolution/30472712/1

下载链接

链接失效反馈

官方服务：

资源简介：

SPIDER (v2) – Synthetic Person Information Dataset for Entity Resolution provides researchers with ready-to-use data for benchmarking Duplicate or Entity Resolution algorithms. The dataset focuses on person-level fields typical in customer or citizen records. Since real-world person-level data is restricted due to Personally Identifiable Information (PII) constraints, publicly available synthetic datasets are limited in scope, volume, or realism.SPIDER addresses these limitations by providing a large-scale, realistic dataset containing first name, last name, email, phone, address, and date of birth (DOB) attributes. Using the Python Faker library, 40,000 unique synthetic person records were generated, followed by 10,000 controlled duplicate records derived using seven real-world transformation rules. Each duplicate record is linked to its original base record and rule through the fields is_duplicate_of and duplication_rule.Version 2 introduces major realism and structural improvements, enhancing both the dataset and generation framework.Enhancements in Version 2New <code>cluster_id</code> column to group base and duplicate records for improved entity-level benchmarking.Improved data realism with consistent field relationships:State and ZIP codes now match correctly.Phone numbers are generated based on state codes.Email addresses are logically related to name components.Refined duplication logic:Rule 4 updated for realistic address variation.Rule 7 enhanced to simulate shared accounts among different individuals (with distinct DOBs).Improved data validation and formatting for address, email, and date fields.Updated Python generation script for modular configuration, reproducibility, and extensibility.Duplicate Rules (with real-world use cases)Duplicate record with a variation in email address. Use case: Same person using multiple email accounts.Duplicate record with a variation in phone numbers. Use case: Same person using multiple contact numbers.Duplicate record with last-name variation. Use case: Name changes or data entry inconsistencies.Duplicate record with address variation. Use case: Same person maintaining multiple addresses or moving residences.Duplicate record with a nickname. Use case: Same person using formal and informal names (Robert → Bob, Elizabeth → Liz).Duplicate record with minor spelling variations in the first name. Use case: Legitimate entry or migration errors (Sara → Sarah).Duplicate record with multiple individuals sharing the same email and last name but different DOBs. Use case: Realistic shared accounts among family members or households (benefits, tax, or insurance portals).Output FormatThe dataset is available in both CSV and JSON formats for direct use in data-processing, machine-learning, and record-linkage frameworks. Data RegenerationThe included Python script can be used to fully regenerate the dataset and supports:Addition of new duplication rulesRegional, linguistic, or domain-specific variationsVolume scaling for large-scale testing scenariosFiles Included<code>spider_dataset_v2_6_20251027_022215.csv</code><code>spider_dataset_v2_6_20251027_022215.json</code><code>spider_readme_v2.md</code><code>SPIDER_generation_script_v2.py</code><code>SupportingDocuments/</code> folder containing:<code>benchmark_comparison_script.py</code> – script used for derive F-1 score.<code>Public_census_data_surname.csv</code> – sample U.S. Census name and demographic data used for comparison.<code>ssa_firstnames.csv</code> – Social Security Administration names dataset.<code>simplemaps_uszips.csv</code> – ZIP-to-state mapping data used for phone and address validation.

提供机构：

figshare

创建时间：

2025-10-29

5,000+

优质数据集

54 个

任务类型

进入经典数据集