Ari-S-123/better-english-pii-anonymizer
收藏Hugging Face2025-12-05 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/Ari-S-123/better-english-pii-anonymizer
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
task_categories:
- token-classification
language:
- en
tags:
- pii
- ner
- privacy
- synthetic-data
size_categories:
- 100K<n<1M
configs:
- config_name: default
data_files:
- split: train
path: "data/train-00000-of-00001.parquet"
- split: test
path: "data/test-00000-of-00001.parquet"
---
# PII Detection Combined Dataset
Combined dataset for PII (Personally Identifiable Information) detection,
merging the ai4privacy English-only subset with synthetically generated
challenging examples targeting NER failure modes.
## Dataset Description
This dataset combines two sources:
1. **[ai4privacy/open-pii-masking-500k](https://huggingface.co/datasets/ai4privacy/open-pii-masking-500k-ai4privacy)** (English subset): 120,533 train / 30,160 test examples
2. **Synthetic data** (Grok-4.1-Non-reasoning generated/GPT-5.1 validated): 4,801 train / 1,201 test examples
**Total**: 125,334 train / 31,361 test examples
### Synthetic Data Feature Dimensions
The synthetic data specifically targets six NER failure mode dimensions
from [Singh & Narayanan (2025) "Unmasking the Reality of PII Masking Models: Performance Gaps and the Call for Accountability"](https://arxiv.org/abs/2504.12308):
| Dimension | Train Count | Description |
|-----------|-------------|-------------|
| adversarial | 464 | Intentionally deceptive patterns |
| basic | 989 | Standard, well-formatted entities |
| contextual | 801 | Ambiguous entities requiring context |
| evolving | 754 | Modern/emerging PII formats |
| multilingual | 917 | International formats in English |
| noisy | 876 | Real-world text imperfections |
## Dataset Schema
| Field | Type | Description |
|-------|------|-------------|
| `source_text` | string | Original text containing PII entities |
| `privacy_mask` | list | Entity annotations with label, start, end, value |
| `data_source` | string | Either "ai4privacy" or "synthetic" |
| `feature_dimension` | string | NER challenge dimension (synthetic only) |
| `language` | string | Language code (always "en") |
## Usage
```python
from datasets import load_dataset
# Load from HuggingFace Hub
dataset = load_dataset("Ari-S-123/better-english-pii-anonymizer")
# Or load from local Parquet files
dataset = load_dataset("parquet", data_files={
"train": "train.parquet",
"test": "test.parquet"
})
# Access examples
print(dataset["train"][0])
```
## Citation
If you use this dataset, please cite:
```bibtex
@misc{pii_combined_dataset_2025,
title={PII Detection Combined Dataset},
year={2025},
publisher={Hugging Face},
note={Combines ai4privacy English subset with synthetic challenging examples}
}
```
## License
MIT License
## Dataset Creation
- **Created**: 2025-12-05
- **ai4privacy source**: [ai4privacy/open-pii-masking-500k-ai4privacy](https://huggingface.co/datasets/ai4privacy/open-pii-masking-500k-ai4privacy)
- **Synthetic generation**: xAI Grok-4.1-Non-reasoning
- **Synthetic validation**: OpenAI GPT-5.1
- **Split strategy**: 80/20 stratified split on synthetic data
---
许可证:MIT
任务类别:
- 令牌分类(Token Classification)
语言:
- 英文(en)
标签:
- 个人可识别信息(Personally Identifiable Information,PII)
- 命名实体识别(Named Entity Recognition,NER)
- 隐私
- 合成数据
样本规模:
- 10万<样本数<100万
配置:
- 配置名称:默认(default)
数据文件:
- 划分集:训练集(train)
路径:"data/train-00000-of-00001.parquet"
- 划分集:测试集(test)
路径:"data/test-00000-of-00001.parquet"
---
# 个人可识别信息(PII)检测联合数据集
本数据集为PII检测专用联合数据集,整合了ai4privacy的纯英文子集与针对NER失效模式生成的高难度合成样本。
## 数据集说明
本数据集包含两个来源:
1. **ai4privacy/open-pii-masking-500k**(英文子集):包含120,533条训练样本与30,160条测试样本
2. **合成数据**(由Grok-4.1-Non-reasoning生成、GPT-5.1验证):包含4,801条训练样本与1,201条测试样本
**总计**:训练样本共125,334条,测试样本共31,361条
### 合成数据特征维度
本合成数据专门针对[Singh与Narayanan(2025)《揭露PII掩码模型的真实面貌:性能差距与问责呼吁》](https://arxiv.org/abs/2504.12308)中提出的6类NER失效模式维度:
| 维度 | 训练样本数 | 描述 |
|------|------------|------|
| 对抗性 | 464 | 故意设计的欺骗性模式 |
| 基础 | 989 | 格式规范的标准实体 |
| 上下文依赖 | 801 | 需要结合上下文才能识别的模糊实体 |
| 演进型 | 754 | 现代/新兴PII格式 |
| 多语言适配 | 917 | 英文语境下的国际格式实体 |
| 噪声型 | 876 | 包含真实文本瑕疵的样本 |
## 数据集架构
| 字段 | 数据类型 | 描述 |
|------|----------|------|
| `source_text` | 字符串 | 包含PII实体的原始文本 |
| `privacy_mask` | 列表 | 包含标签、起始位置、结束位置与实体值的实体标注集合 |
| `data_source` | 字符串 | 数据源,取值为"ai4privacy"或"synthetic" |
| `feature_dimension` | 字符串 | NER挑战维度(仅合成数据包含该字段) |
| `language` | 字符串 | 语言代码(固定为"en") |
## 使用方法
python
from datasets import load_dataset
# 从Hugging Face Hub加载数据集
dataset = load_dataset("Ari-S-123/better-english-pii-anonymizer")
# 或从本地Parquet文件加载
dataset = load_dataset("parquet", data_files={
"train": "train.parquet",
"test": "test.parquet"
})
# 访问样本示例
print(dataset["train"][0])
## 引用格式
若您使用本数据集,请引用如下:
bibtex
@misc{pii_combined_dataset_2025,
title={PII Detection Combined Dataset},
year={2025},
publisher={Hugging Face},
note={Combines ai4privacy English subset with synthetic challenging examples}
}
## 许可证
MIT许可证
## 数据集构建信息
- **构建日期**:2025-12-05
- **ai4privacy数据源**:[ai4privacy/open-pii-masking-500k-ai4privacy](https://huggingface.co/datasets/ai4privacy/open-pii-masking-500k-ai4privacy)
- **合成数据生成**:xAI Grok-4.1-Non-reasoning
- **合成数据验证**:OpenAI GPT-5.1
- **划分策略**:对合成数据进行80/20分层划分
提供机构:
Ari-S-123



