Synthetic Dataset for PII Detection and Anonymization in Financial Documents
收藏doi.org2025-03-22 收录
下载链接:
http://doi.org/10.17632/tzrjx692jy.1
下载链接
链接失效反馈官方服务:
资源简介:
This dataset is designed for Training and Testing Machine Learning Models for Detecting and Anonymizing Personally Identifiable Information (PII) in financial documents. This dataset adheres to the highest data privacy standards and is fully synthetic, ensuring no real-world personal data is included. The dataset simulates various PII entities typically found in financial contexts. It is intended to support the development and evaluation of PII Detection and Anonymization Models. It includes training and testing sets of Synthetic Entries generated using realistic financial document structures.
Each entry simulates real-world financial texts such as auditor reports, tax filings, compliance notices, and transaction confirmations. The dataset contains a variety of Synthetic PII types embedded into these documents, including:
• Names
• Social Security Numbers (SSNs)
• Credit Card Numbers
• Phone Numbers
• Email Addresses
• Physical Addresses
• Company Names
• URLs
Dataset Structure:
The Training and Testing Datasets have the following structure of Synthetic Data:
Columns:
1. Name: Contains the synthetic full names of individuals, generated with a mix of genders and cultural backgrounds.
2. Credit Card: Lists synthetic credit card information, including card numbers, expiration dates, and security codes. Various credit card types (e.g., VISA, MasterCard, American Express) are represented.
3. Email: Includes synthetic email addresses in realistic formats with diverse domain names.
4. URL: Contains synthetic website URLs from various domains (e.g., .com, .org, .info), mimicking the variety found in real financial documents.
5. Phone: Represents synthetic phone numbers in different formats, including international formats.
6. Address: Consists of detailed synthetic addresses, including street names, cities, states, and postal codes, generated in various formats.
7. Company: Includes synthetic company names across different industries, providing a realistic mix of common and unique names.
8. SSN: Synthetic Social Security Numbers (SSNs) presented in various formats, including different region-specific patterns (e.g., with or without hyphens).
9. Text: The main body of text simulating financial document content such as audits, reports, invoices, compliance notices, or transaction confirmations. Each text entry contains embedded PII data.
10. True Predictions: Lists and Annotates the exact starting and ending character positions of each PII entity within the "Text" column, along with the entity type (e.g., 'name', 'email', 'address', etc.).
Please Note: This dataset does not contain any real-world Sensitive Information.
本数据集旨在为训练和测试用于在金融文件中检测和匿名化个人可识别信息(PII)的机器学习模型提供支持。该数据集遵循最严格的数据隐私标准,且完全为合成数据,确保不包含任何现实世界的个人数据。数据集模拟了金融环境中常见的各类PII实体。其目的在于促进PII检测和匿名化模型的开发与评估。数据集包括使用现实金融文件结构生成的合成条目训练集和测试集。
每个条目模拟了现实世界的金融文本,例如审计报告、税务申报、合规通知和交易确认。数据集中嵌入了多种合成PII类型,包括:
• 姓名
• 社会安全号码(SSN)
• 信用卡号码
• 电话号码
• 电子邮件地址
• 物理地址
• 公司名称
• 网址
数据集结构:
训练集和测试集的合成数据具有以下结构:
列:
1. 姓名:包含由性别和文化背景混合生成的合成个人全名。
2. 信用卡:列出合成信用卡信息,包括卡号、到期日期和安全码。代表各种信用卡类型(例如,VISA、万事达卡、美国运通)。
3. 电子邮件:包括具有现实格式和多样化域名名称的合成电子邮件地址。
4. 网址:包含来自各种域(例如,.com、.org、.info)的合成网站URL,模仿现实金融文件中的多样性。
5. 电话:表示不同格式的合成电话号码,包括国际格式。
6. 地址:由街道名称、城市、州和邮政编码组成的详细合成地址,以各种格式生成。
7. 公司:包含来自不同行业的合成公司名称,提供常见和独特名称的合理混合。
8. SSN:以各种格式呈现的合成社会安全号码(SSN),包括不同区域特定模式(例如,带或不带连字符)。
9. 文本:模拟财务文件内容的主要文本部分,如审计、报告、发票、合规通知或交易确认。每个文本条目都包含嵌入的PII数据。
10. 真实预测:列出并标注“文本”列中每个PII实体精确的起始和结束字符位置,以及实体类型(例如,'name'、'email'、'address'等)。
请注意:本数据集不包含任何现实世界的敏感信息。
提供机构:
Mendeley Data



