SPIDER - Synthetic Person Information Dataset for Entity Resolution
收藏DataCite Commons2025-07-24 更新2025-09-08 收录
下载链接:
https://figshare.com/articles/dataset/SPIDER_-_Synthetic_Person_Information_Dataset_for_Entity_Resolution/29595599/2
下载链接
链接失效反馈官方服务:
资源简介:
SPIDER - Synthetic Person Information Dataset for Entity Resolution offers researchers with ready to use data that can be utilized in benchmarking Duplicate or Entity Resolution algorithms. The dataset is aimed at person-level fields that are typical in customer data. As it is hard to source real world person level data due to Personally Identifiable Information (PII), there are very few synthetic data available publicly. The current datasets also come with limitations of small volume and core person-level fields missing in the dataset. SPIDER addresses the challenges by focusing on core person level attributes - <i>first/last name, email, phone, address and dob</i>. Using Python Faker library, 40,000 unique, synthetic person records are created. An additional 10,000 duplicate records are generated from the base records using 7 real-world transformation rules. The duplicate records are labelled with original base record and the duplicate rule used for record generation through <i>is_duplicate_of </i>and <i>duplication_rule</i> fields<br><br><b>Duplicate Rules</b>Duplicate record with a variation in email address.Duplicate record with a variation in email addressDuplicate record with last name variationDuplicate record with first name variationDuplicate record with a nicknameDuplicate record with near exact spellingDuplicate record with only same email and name<b>Output Format</b>The dataset is presented in both JSON and CSV formats for use in data processing and machine learning tools.<br><br><b>Data Regeneration</b>The project includes the python script used for generating the 50,000 person records. The Python script can be expanded to include - additional duplicate rules, fuzzy name, geographical names' variations and volume adjustments.<br><br><b>Files Included</b>spider_dataset_20250714_035016.csvspider_dataset_20250714_035016.jsonspider_readme.mdDataDescriptionspythoncodeV1.py<br>
SPIDER——面向实体解析(Entity Resolution)的合成人物信息数据集,可为研究人员提供可直接复用的数据,用于基准测试重复数据检测或实体解析算法。该数据集聚焦于客户数据中常见的人物级字段。由于受个人可识别信息(Personally Identifiable Information, PII)限制,获取真实世界的人物级数据难度较大,当前公开可用的合成数据十分稀缺。现有数据集还存在数据体量偏小、核心人物级字段缺失等局限。SPIDER针对上述挑战进行优化,聚焦核心人物属性——<i>名/姓、电子邮箱、电话号码、地址及出生日期(date of birth, dob)</i>。研究人员借助Python Faker库生成了40000条唯一的合成人物记录。随后基于基础记录,通过7种真实世界变换规则生成额外的10000条重复记录。重复记录通过`is_duplicate_of`与`duplication_rule`两个字段进行标记,其中`is_duplicate_of`指向对应的原始基础记录,`duplication_rule`标注了生成该重复记录所使用的规则。
<b>重复生成规则</b>
1. 电子邮箱地址存在变体的重复记录
2. 电子邮箱地址存在变体的重复记录
3. 姓氏存在变体的重复记录
4. 名字存在变体的重复记录
5. 使用昵称的重复记录
6. 拼写近似的重复记录
7. 仅电子邮箱与姓名完全一致的重复记录
<b>输出格式</b>
本数据集同时提供JSON与CSV格式,可直接用于数据处理与机器学习工具。
<b>数据再生说明</b>
本项目附带用于生成50000条人物记录的Python脚本。该脚本支持扩展,可新增额外重复生成规则、模糊姓名变体、地理名称变体以及调整数据生成规模。
<b>包含文件</b>
spider_dataset_20250714_035016.csv
spider_dataset_20250714_035016.json
spider_readme.md
DataDescriptions
pythoncodeV1.py
提供机构:
figshare
创建时间:
2025-07-24



