gretel-pii-masking-en-v1
收藏魔搭社区2025-12-05 更新2025-05-24 收录
下载链接:
https://modelscope.cn/datasets/gretelai/gretel-pii-masking-en-v1
下载链接
链接失效反馈官方服务:
资源简介:
<center>
<img src="pii_masking_en-v1.png" width=600>
</center>
# Gretel Synthetic Domain-Specific Documents Dataset (English)
This dataset is a synthetically generated collection of documents enriched with Personally Identifiable Information (PII) and Protected Health Information (PHI) entities spanning multiple domains.
Created using Gretel Navigator with mistral-nemo-2407 as the backend model, it is specifically designed for fine-tuning Gliner models.
The dataset contains document passages featuring PII/PHI entities from a wide range of domain and document types, making it an ideal resource for tasks such as Named Entity Recognition (NER), text classification, and domain-specific document analysis
## Key Features
This dataset is designed to provide a comprehensive resource for developing and fine-tuning models in tasks involving sensitive information across various domains.
- **Synthetic Data Generation**: This dataset is entirely synthetically generated using Gretel Navigator, ensuring a rich variety of realistic and diverse data samples that have undergone automated validation for quality and consistency.
- **Entity Extraction for PII/PHI**: Documents contain a wide range of PII and PHI entities, including names, dates, and unique identifiers, categorized by type. This provides a valuable foundation for training models on tasks like NER, PII detection, and sensitive data redaction.
- **Diverse Real-World Contexts**: The dataset covers multiple industries such as finance, healthcare, cybersecurity, and others, providing broad coverage across different document types and enhancing model generalization.
- **Document Descriptions**: Each document includes a description that outlines its structure and typical contents, aiding in document classification and understanding.
## Dataset Column Descriptions
The dataset includes several key columns, each providing vital information for understanding and utilizing the documents effectively in various AI and NLP tasks:
- **uid**: A unique identifier for each document, ensuring traceability and easy reference.
- **domain**: Represents the industry or sector to which the document belongs (e.g., healthcare, finance, technology). (see Domain Distribution below).
- **document_type**: Specifies the category or format of the document within a given domain (e.g., invoices, contracts, medical records).
- **document_description**: Provides a detailed overview of the document’s structure, typical fields, and its intended purpose, offering additional context for document classification tasks.
- **text**: The core text content of the document, serving as a rich data source for various NLP tasks such as text classification, NER, and more.
- **entities**: A list of extracted PII/PHI entities within the document. Each entity is annotated with its type (e.g., name, address, date of birth), facilitating tasks like entity recognition and sensitive information extraction.
## Dataset Statistics and Distribution
This dataset is split into training (50k records), validation (5k), and test (5k) sets, with a distribution across multiple domains and entity types.
### Domain Distribution
The dataset contains documents from a wide range of domains, making it suitable for various industrial applications and research purposes.
| Domain | Train | Validation | Test |
| --- | --- | --- | --- |
| aerospace-defense | 1067 | 108 | 106 |
| agriculture-food-services | 1121 | 114 | 122 |
| authentication-services | 939 | 99 | 88 |
| automotive | 1112 | 103 | 97 |
| aviation | 1062 | 99 | 81 |
| banking | 1013 | 103 | 118 |
| biometrics | 1281 | 103 | 113 |
| blockchain | 1034 | 101 | 105 |
| cloud-services | 1065 | 120 | 118 |
| code-review | 659 | 56 | 66 |
| compliance-regulation | 1249 | 148 | 147 |
| cryptography | 1197 | 119 | 104 |
| cybersecurity | 927 | 104 | 103 |
| data-privacy | 1395 | 144 | 138 |
| defense-security | 1077 | 95 | 103 |
| digital-certificates | 1072 | 103 | 97 |
| digital-payments | 947 | 102 | 112 |
| e-commerce | 730 | 85 | 65 |
| education | 972 | 104 | 95 |
| energy-utilities | 1024 | 113 | 115 |
| finance | 1001 | 102 | 94 |
| financial-services | 1027 | 94 | 117 |
| government | 1224 | 124 | 111 |
| healthcare | 1207 | 108 | 133 |
| healthcare-administration | 1194 | 131 | 118 |
| human-resources | 933 | 80 | 79 |
| identity-verification | 1298 | 118 | 121 |
| information-technology | 808 | 73 | 87 |
| insurance | 1087 | 106 | 116 |
| internet-services | 1074 | 119 | 115 |
| legal-documents | 790 | 73 | 68 |
| logistics-transportation | 1291 | 147 | 130 |
| manufacturing | 1283 | 124 | 125 |
| marine | 1049 | 104 | 99 |
| media-entertainment | 864 | 93 | 81 |
| networking | 1097 | 109 | 92 |
| non-profit-charity | 920 | 86 | 85 |
| pharmaceuticals-biotechnology | 1273 | 133 | 152 |
| public-sector | 1234 | 124 | 119 |
| real-estate | 882 | 100 | 91 |
| retail-consumer-goods | 977 | 96 | 99 |
| security | 1155 | 119 | 111 |
| supply-chain | 1206 | 113 | 125 |
| technology-software | 917 | 93 | 79 |
| telecommunications | 1005 | 105 | 123 |
| transportation | 1286 | 143 | 134 |
| travel-hospitality | 975 | 60 | 103 |
### Entity Type Distribution
The dataset includes a broad variety of entity types, focusing heavily on PII and PHI to support privacy-enhancing model development.
| Entity Type | Train | Validation | Test |
| --- | --- | --- | --- |
| medical_record_number | 26031 | 2589 | 2658 |
| date_of_birth | 23684 | 2345 | 2331 |
| ssn | 16877 | 1734 | 1661 |
| date | 11740 | 1170 | 1157 |
| first_name | 11421 | 1098 | 1172 |
| email | 10891 | 1104 | 1049 |
| last_name | 10804 | 1040 | 1057 |
| customer_id | 10023 | 1025 | 1033 |
| employee_id | 9945 | 988 | 1005 |
| name | 9688 | 1015 | 980 |
| street_address | 8576 | 939 | 869 |
| phone_number | 8537 | 884 | 904 |
| ipv4 | 8235 | 817 | 896 |
| credit_card_number | 6469 | 634 | 663 |
| license_plate | 6000 | 613 | 579 |
| address | 5563 | 551 | 563 |
| user_name | 3252 | 305 | 338 |
| device_identifier | 2253 | 213 | 249 |
| bank_routing_number | 2158 | 210 | 257 |
| date_time | 2043 | 206 | 211 |
| company_name | 1950 | 177 | 185 |
| unique_identifier | 1842 | 189 | 189 |
| biometric_identifier | 1552 | 129 | 137 |
| account_number | 1328 | 134 | 141 |
| city | 1321 | 138 | 128 |
| certificate_license_number | 1307 | 133 | 124 |
| time | 1104 | 112 | 106 |
| postcode | 955 | 93 | 104 |
| vehicle_identifier | 941 | 101 | 98 |
| coordinate | 814 | 62 | 85 |
| country | 767 | 78 | 71 |
| api_key | 731 | 83 | 60 |
| ipv6 | 622 | 61 | 66 |
| password | 569 | 64 | 59 |
| health_plan_beneficiary_number | 446 | 48 | 41 |
| national_id | 425 | 44 | 46 |
| tax_id | 303 | 31 | 23 |
| url | 287 | 40 | 23 |
| state | 284 | 24 | 27 |
| swift_bic | 209 | 22 | 28 |
| cvv | 96 | 11 | 3 |
| pin | 27 | 4 | 2 |
## Fine-Tuned Models
We have fine-tuned multiple models using this dataset, which are available on Hugging Face:
- [`gretelai/gretel-gliner-bi-small-v1.0`](https://huggingface.co/gretelai/gretel-gliner-bi-small-v1.0)
- [`gretelai/gretel-gliner-bi-base-v1.0`](https://huggingface.co/gretelai/gretel-gliner-bi-base-v1.0)
- [`gretelai/gretel-gliner-bi-large-v1.0`](https://huggingface.co/gretelai/gretel-gliner-bi-large-v1.0)
These models are specifically designed for high-quality PII/PHI detection focusing on the entity types listed in this dataset.
## Citation and Usage
If you use this dataset in your research or applications, please cite it as:
```bibtex
@dataset{gretel-pii-docs-en-v1,
author = {Gretel AI},
title = {GLiNER Models for PII Detection through Fine-Tuning on Gretel-Generated Synthetic Documents},
year = {2024},
month = {10},
publisher = {Gretel},
}
```
For questions, issues, or additional information, please visit our [Synthetic Data Discord](https://gretel.ai/discord) community or reach out to [gretel.ai](https://gretel.ai/).
# Gretel 特定领域合成文档数据集(英文版本)
本数据集为合成生成的文档集合,富含跨多个领域的个人可识别信息(Personally Identifiable Information, PII)与受保护健康信息(Protected Health Information, PHI)实体。本数据集依托Gretel Navigator构建,后端模型为mistral-nemo-2407,专为微调GLiNER模型设计。数据集包含来自多样领域与文档类型的含PII/PHI实体的文档段落,是命名实体识别(Named Entity Recognition, NER)、文本分类及特定领域文档分析等任务的理想资源。
## 核心特性
本数据集旨在为跨领域敏感信息相关任务的模型开发与微调提供全面资源。
- **合成数据生成**:本数据集完全通过Gretel Navigator合成生成,涵盖丰富多样的真实且多样化的数据样本,并经过自动化验证以确保质量与一致性。
- **PII/PHI实体抽取**:文档包含多种类别的PII与PHI实体,涵盖姓名、日期及唯一标识符等,可为NER、PII检测及敏感数据脱敏等任务的模型训练提供宝贵基础。
- **多样化真实场景**:数据集覆盖金融、医疗、网络安全等多个行业,涵盖不同文档类型,可提升模型泛化能力。
- **文档描述**:每份文档均包含描述其结构与典型内容的说明,有助于文档分类与理解。
## 数据集字段说明
本数据集包含多个关键字段,可为在各类AI与自然语言处理(Natural Language Processing, NLP)任务中有效利用文档提供必要信息支撑:
- **uid**:每份文档的唯一标识符,确保可追溯性与便捷引用。
- **domain**:表示文档所属的行业或领域(如医疗、金融、科技)(详见下文领域分布)。
- **document_type**:指定特定领域内的文档类别或格式(如发票、合同、医疗记录)。
- **document_description**:详细概述文档的结构、典型字段及预期用途,为文档分类任务提供额外上下文。
- **text**:文档的核心文本内容,可为文本分类、NER等多种自然语言处理任务提供丰富数据源。
- **entities**:文档中抽取的PII/PHI实体列表。每个实体均标注了其类型(如姓名、地址、出生日期),可助力实体识别与敏感信息抽取等任务。
## 数据集统计与分布
本数据集划分为训练集(50,000条记录)、验证集(5,000条)与测试集(5,000条),并在多个领域与实体类型间进行分布。
### 领域分布
本数据集包含来自广泛领域的文档,适用于各类工业应用与研究场景。
| 领域 | 训练集 | 验证集 | 测试集 |
| --- | --- | --- | --- |
| 航空航天与国防 | 1067 | 108 | 106 |
| 农业与食品服务 | 1121 | 114 | 122 |
| 身份验证服务 | 939 | 99 | 88 |
| 汽车 | 1112 | 103 | 97 |
| 航空 | 1062 | 99 | 81 |
| 银行业 | 1013 | 103 | 118 |
| 生物识别 | 1281 | 103 | 113 |
| 区块链 | 1034 | 101 | 105 |
| 云服务 | 1065 | 120 | 118 |
| 代码评审 | 659 | 56 | 66 |
| 合规监管 | 1249 | 148 | 147 |
| 密码学 | 1197 | 119 | 104 |
| 网络安全 | 927 | 104 | 103 |
| 数据隐私 | 1395 | 144 | 138 |
| 国防安全 | 1077 | 95 | 103 |
| 数字证书 | 1072 | 103 | 97 |
| 数字支付 | 947 | 102 | 112 |
| 电子商务 | 730 | 85 | 65 |
| 教育 | 972 | 104 | 95 |
| 能源与公用事业 | 1024 | 113 | 115 |
| 金融 | 1001 | 102 | 94 |
| 金融服务 | 1027 | 94 | 117 |
| 政府 | 1224 | 124 | 111 |
| 医疗保健 | 1207 | 108 | 133 |
| 医疗管理 | 1194 | 131 | 118 |
| 人力资源 | 933 | 80 | 79 |
| 身份验证 | 1298 | 118 | 121 |
| 信息技术 | 808 | 73 | 87 |
| 保险 | 1087 | 106 | 116 |
| 互联网服务 | 1074 | 119 | 115 |
| 法律文书 | 790 | 73 | 68 |
| 物流与运输 | 1291 | 147 | 130 |
| 制造业 | 1283 | 124 | 125 |
| 海事 | 1049 | 104 | 99 |
| 媒体与娱乐 | 864 | 93 | 81 |
| 网络 | 1097 | 109 | 92 |
| 非营利慈善 | 920 | 86 | 85 |
| 制药与生物技术 | 1273 | 133 | 152 |
| 公共部门 | 1234 | 124 | 119 |
| 房地产 | 882 | 100 | 91 |
| 零售与消费品 | 977 | 96 | 99 |
| 安防 | 1155 | 119 | 111 |
| 供应链 | 1206 | 113 | 125 |
| 技术与软件 | 917 | 93 | 79 |
| 电信 | 1005 | 105 | 123 |
| 运输 | 1286 | 143 | 134 |
| 旅游与酒店 | 975 | 60 | 103 |
### 实体类型分布
本数据集包含多种实体类型,重点覆盖PII与PHI,以支持隐私增强型模型开发。
| 实体类型 | 训练集 | 验证集 | 测试集 |
| --- | --- | --- | --- |
| 病历号 | 26031 | 2589 | 2658 |
| 出生日期 | 23684 | 2345 | 2331 |
| 社会安全号码(SSN) | 16877 | 1734 | 1661 |
| 日期 | 11740 | 1170 | 1157 |
| 名字 | 11421 | 1098 | 1172 |
| 电子邮箱 | 10891 | 1104 | 1049 |
| 姓氏 | 10804 | 1040 | 1057 |
| 客户ID | 10023 | 1025 | 1033 |
| 员工ID | 9945 | 988 | 1005 |
| 姓名 | 9688 | 1015 | 980 |
| 街道地址 | 8576 | 939 | 869 |
| 电话号码 | 8537 | 884 | 904 |
| IPv4地址 | 8235 | 817 | 896 |
| 信用卡号 | 6469 | 634 | 663 |
| 车牌号码 | 6000 | 613 | 579 |
| 地址 | 5563 | 551 | 563 |
| 用户名 | 3252 | 305 | 338 |
| 设备标识符 | 2253 | 213 | 249 |
| 银行路由号码 | 2158 | 210 | 257 |
| 日期时间 | 2043 | 206 | 211 |
| 公司名称 | 1950 | 177 | 185 |
| 唯一标识符 | 1842 | 189 | 189 |
| 生物识别标识符 | 1552 | 129 | 137 |
| 账户号码 | 1328 | 134 | 141 |
| 城市 | 1321 | 138 | 128 |
| 证书/许可证编号 | 1307 | 133 | 124 |
| 时间 | 1104 | 112 | 106 |
| 邮政编码 | 955 | 93 | 104 |
| 车辆标识符 | 941 | 101 | 98 |
| 坐标 | 814 | 62 | 85 |
| 国家 | 767 | 78 | 71 |
| API密钥 | 731 | 83 | 60 |
| IPv6地址 | 622 | 61 | 66 |
| 密码 | 569 | 64 | 59 |
| 健康计划受益人编号 | 446 | 48 | 41 |
| 国民身份证号码 | 425 | 44 | 46 |
| 税务识别号 | 303 | 31 | 23 |
| 统一资源定位符(URL) | 287 | 40 | 23 |
| 州/省 | 284 | 24 | 27 |
| SWIFT/BIC代码 | 209 | 22 | 28 |
| CVV码 | 96 | 11 | 3 |
| PIN码 | 27 | 4 | 2 |
## 微调模型
我们使用本数据集微调了多款模型,相关模型可在Hugging Face平台获取:
- [`gretelai/gretel-gliner-bi-small-v1.0`](https://huggingface.co/gretelai/gretel-gliner-bi-small-v1.0)
- [`gretelai/gretel-gliner-bi-base-v1.0`](https://huggingface.co/gretelai/gretel-gliner-bi-base-v1.0)
- [`gretelai/gretel-gliner-bi-large-v1.0`](https://huggingface.co/gretelai/gretel-gliner-bi-large-v1.0)
这些模型专为高质量PII/PHI检测设计,聚焦本数据集所列的实体类型。
## 引用与使用
若您在研究或应用中使用本数据集,请按以下格式引用:
bibtex
@dataset{gretel-pii-docs-en-v1,
author = {Gretel AI},
title = {基于Gretel生成的合成文档微调GLiNER模型用于PII检测},
year = {2024},
month = {10},
publisher = {Gretel},
}
如需咨询、反馈或获取更多信息,请访问我们的[合成数据Discord社区](https://gretel.ai/discord)或联系[gretel.ai](https://gretel.ai/)。
提供机构:
maas
创建时间:
2025-05-20



