Ari-S-123/pii-detection-english-consolidated
收藏Hugging Face2025-12-07 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/Ari-S-123/pii-detection-english-consolidated
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: source_text
dtype: string
- name: privacy_mask
list:
- name: end
dtype: int64
- name: label
dtype: string
- name: label_index
dtype: int64
- name: start
dtype: int64
- name: value
dtype: string
- name: feature_dimension
dtype: string
- name: seed_pii_type
dtype: string
- name: seed_pii_value
dtype: string
- name: seed_pii_locale
dtype: string
- name: scenario
dtype: string
- name: type_variant
dtype: string
- name: generation_id
dtype: string
- name: data_source
dtype: string
- name: language
dtype: string
- name: region
dtype: string
- name: script
dtype: string
splits:
- name: train
num_bytes: 40500982
num_examples: 125327
- name: test
num_bytes: 10102881
num_examples: 31361
download_size: 19401115
dataset_size: 50603863
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- split: test
path: data/test-*
license: mit
task_categories:
- token-classification
language:
- en
tags:
- pii
- ner
- privacy
- synthetic-data
size_categories:
- 100K<n<1M
---
# PII Detection Combined Dataset
Combined dataset for PII (Personally Identifiable Information) detection,
merging the ai4privacy English-only subset with synthetically generated and semantically validated with different LLMs
challenging examples targeting NER failure modes. Class labels had to be consolidated to prevent label fragmentation too.
## Dataset Description
This dataset combines two sources:
1. **[ai4privacy/open-pii-masking-500k](https://huggingface.co/datasets/ai4privacy/open-pii-masking-500k-ai4privacy)** (English subset): 120,533 train / 30,160 test examples
2. **Synthetic data** (Grok-4.1-Non-reasoning generated/GPT-5.1 validated): 4,801 train / 1,201 test examples
**Total**: 125,334 train / 31,361 test examples
### Synthetic Data Feature Dimensions
The synthetic data specifically targets six NER failure mode dimensions
from [Singh & Narayanan (2025) "Unmasking the Reality of PII Masking Models: Performance Gaps and the Call for Accountability"](https://arxiv.org/abs/2504.12308):
| Dimension | Train Count | Description |
|-----------|-------------|-------------|
| adversarial | 464 | Intentionally deceptive patterns |
| basic | 989 | Standard, well-formatted entities |
| contextual | 801 | Ambiguous entities requiring context |
| evolving | 754 | Modern/emerging PII formats |
| multilingual | 917 | International formats in English |
| noisy | 876 | Real-world text imperfections |
## Dataset Schema
| Field | Type | Description |
|-------|------|-------------|
| `source_text` | string | Original text containing PII entities |
| `privacy_mask` | list | Entity annotations with label, start, end, value |
| `data_source` | string | Either "ai4privacy" or "synthetic" |
| `feature_dimension` | string | NER challenge dimension (synthetic only) |
| `language` | string | Language code (always "en") |
## Usage
```python
from datasets import load_dataset
# Load from HuggingFace Hub
dataset = load_dataset("Ari-S-123/better-english-pii-anonymizer")
# Or load from local Parquet files
dataset = load_dataset("parquet", data_files={
"train": "train.parquet",
"test": "test.parquet"
})
# Access examples
print(dataset["train"][0])
```
## Citation
If you use this dataset, please cite:
```bibtex
@misc{pii_combined_dataset_2025,
title={PII Detection Combined Dataset},
year={2025},
publisher={Hugging Face},
note={Combines ai4privacy English subset with synthetic challenging examples}
}
```
## License
MIT License
## Dataset Creation
- **Created**: 2025-12-05
- **ai4privacy source**: [ai4privacy/open-pii-masking-500k-ai4privacy](https://huggingface.co/datasets/ai4privacy/open-pii-masking-500k-ai4privacy)
- **Synthetic generation**: xAI Grok-4.1-Non-Reasoning
- **Synthetic validation**: OpenAI GPT-5.1 Low Reasoning Effort
- **Split strategy**: 80/20 stratified split on synthetic data
数据集元信息:
特征字段:
- 名称:source_text,数据类型:字符串
- 名称:privacy_mask,数据类型:列表,子字段包括:
- end:int64类型,实体结束位置
- label:字符串类型,实体标签
- label_index:int64类型,标签索引
- start:int64类型,实体起始位置
- value:字符串类型,实体值
- 名称:feature_dimension,数据类型:字符串
- 名称:seed_pii_type,数据类型:字符串
- 名称:seed_pii_value,数据类型:字符串
- 名称:seed_pii_locale,数据类型:字符串
- 名称:scenario,数据类型:字符串
- 名称:type_variant,数据类型:字符串
- 名称:generation_id,数据类型:字符串
- 名称:data_source,数据类型:字符串
- 名称:language,数据类型:字符串
- 名称:region,数据类型:字符串
- 名称:script,数据类型:字符串
数据集划分:
- 划分集名称:train,字节数:40500982,样本数量:125327
- 划分集名称:test,字节数:10102881,样本数量:31361
下载大小:19401115字节,总数据集大小:50603863字节
配置项:
- 配置名称:default,数据文件路径:
- 划分集train:data/train-*
- 划分集test:data/test-*
许可证:MIT
任务类别:令牌分类(token-classification)
语言:en(英文)
标签:PII(个人可识别信息,Personally Identifiable Information)、NER(命名实体识别,Named Entity Recognition)、privacy(隐私)、synthetic-data(合成数据)
样本规模区间:100K < n < 1M
# PII检测组合数据集(PII, Personally Identifiable Information,个人可识别信息)
本数据集用于个人可识别信息(PII, Personally Identifiable Information)检测任务,整合了ai4privacy的英文子集,以及由不同大语言模型(LLM)生成并经语义验证的、针对命名实体识别(NER, Named Entity Recognition)失效模式的挑战性合成样本。同时为避免标签碎片化,对类别标签进行了统一整合。
## 数据集描述
本数据集整合了两类数据源:
1. **[ai4privacy/open-pii-masking-500k](https://huggingface.co/datasets/ai4privacy/open-pii-masking-500k-ai4privacy)(英文子集)**:训练集120,533条,测试集30,160条
2. **合成数据**:由xAI Grok-4.1-非推理模式生成、OpenAI GPT-5.1验证:训练集4,801条,测试集1,201条
**总规模**:训练集125,334条,测试集31,361条
### 合成数据特征维度
合成数据专门针对[Singh & Narayanan (2025)《揭开PII掩码模型的真实面纱:性能差距与问责呼吁》](https://arxiv.org/abs/2504.12308)中提出的6类NER失效模式维度设计:
| 维度 | 训练集数量 | 描述 |
|--------------|------------|--------------------------|
| 对抗性样本 | 464 | 蓄意设计的欺骗性模式 |
| 基础样本 | 989 | 格式规范的标准实体 |
| 上下文依赖 | 801 | 需要结合上下文才能识别的歧义实体 |
| 新兴格式 | 754 | 现代/新兴的PII格式 |
| 多语言适配 | 917 | 英文语境下的国际通用格式 |
| 噪声样本 | 876 | 包含真实文本瑕疵的样本 |
## 数据集模式
| 字段名 | 数据类型 | 描述 |
|---------------------|----------|----------------------------------------------------------------------|
| `source_text` | 字符串 | 包含PII实体的原始文本 |
| `privacy_mask` | 列表 | 实体标注集合,包含标签、起始位置、结束位置、实体值 |
| `data_source` | 字符串 | 数据源类型,取值为"ai4privacy"或"synthetic" |
| `feature_dimension` | 字符串 | NER挑战维度(仅合成数据包含该字段) |
| `language` | 字符串 | 语言代码,固定为"en"(英文) |
## 使用方法
python
from datasets import load_dataset
# 从Hugging Face Hub加载数据集
dataset = load_dataset("Ari-S-123/better-english-pii-anonymizer")
# 或从本地Parquet文件加载
dataset = load_dataset("parquet", data_files={
"train": "train.parquet",
"test": "test.parquet"
})
# 访问样本示例
print(dataset["train"][0])
## 引用信息
若使用本数据集,请引用:
bibtex
@misc{pii_combined_dataset_2025,
title={PII检测组合数据集},
year={2025},
publisher={Hugging Face},
note={整合了ai4privacy英文子集与合成挑战性样本}
}
## 许可证
MIT许可证
## 数据集构建信息
- **创建时间**:2025-12-05
- **ai4privacy数据源**:[ai4privacy/open-pii-masking-500k-ai4privacy](https://huggingface.co/datasets/ai4privacy/open-pii-masking-500k-ai4privacy)
- **合成数据生成**:xAI Grok-4.1-非推理模式
- **合成数据验证**:OpenAI GPT-5.1 低推理工作量模式
- **划分策略**:对合成数据采用80/20分层划分
提供机构:
Ari-S-123



