tursunait/roberta-pii-synth
收藏Hugging Face2025-12-08 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/tursunait/roberta-pii-synth
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
datasets:
- tursunait/RoBERTa-pii-synth
language:
- en
tags:
- pii
- ner
- synthetic-data
- token-classification
- deidentification
- privacy
- nlp
task_categories:
- token-classification
task_ids:
- named-entity-recognition
pretty_name: RoBERTa PII Synthetic Dataset
size_categories:
- 100K<n<1M
---
# **Synthetic PII Detection Dataset (RoBERTa-PII-Synth)**
*A large-scale, fully synthetic dataset for training token-classification models to detect Personally Identifiable Information (PII) in realistic text.*
This dataset was built using an **enhanced synthetic generation pipeline**, designed to better capture the linguistic and formatting variability of real-world user text. All samples are **fully artificial** — no real people or identifiers appear anywhere.
---
# **📘 Dataset Summary**
**RoBERTa-PII-Synth** contains **120k+ synthetic examples**, each with:
- Natural-language text (short, medium, or long multi-sentence samples)
- Character-level PII span annotations
- Tokenized features for RoBERTa (`tokens`, `input_ids`, `attention_mask`, `labels`)
- A diverse set of entity types:
- `PERSON`, `EMAIL`, `PHONE`, `ORG`, `ADDRESS`,
- `DATE`, `CREDIT_CARD`, `SSN`, **`AGE` (new)**
The dataset includes:
✔ **Obfuscated PII**
(e.g., `john[at]gmail[dot]com`, spaced-out phone numbers, misspellings)
✔ **Heavy format diversity**
(usernames, international phone formats, dotted/space-separated SSNs)
✔ **Noise injection**
(length-preserving noise outside entities; realistic corruption inside entities)
✔ **Hard negatives**
(GUIDs, MAC addresses, SHA1 hashes, invalid credit card numbers)
✔ **Clean all-O examples**
(realistic non-PII text for improving precision)
---
# **📁 Dataset Structure**
### **Splits**
| Split | Samples |
|-------|---------|
| **Train** | ~96,000 |
| **Validation** | ~12,000 |
| **Test** | ~12,000 |
---
### **Features**
| Feature | Type | Description |
|---------|------|-------------|
| `text` | `string` | Raw synthetic text |
| `spans` | `list[{start,end,label}]` | Character-level entity annotations |
| `tokens` | `list[string]` | Word-level tokens (RoBERTa tokenizer) |
| `input_ids` | `list[int]` | RoBERTa token IDs |
| `attention_mask` | `list[int]` | Mask for valid tokens |
| `labels` | `list[int]` | Token classification labels (BILOU-coded) |
---
# **📥 How to Load the Dataset**
```python
from datasets import load_dataset
ds = load_dataset("tursunait/RoBERTa-pii-synth")
train = ds["train"]
val = ds["validation"]
test = ds["test"]
```
Inspect sample:
```pyhton
sample = train[0]
sample
```
Example sample:
```json
{
"text": "Contact kees.guirard@aol.com or +31 880 385 2406. Applicant: John D. Smith, DOB 1990-05-15.",
"spans": [
{"start": 8, "end": 29, "label": "EMAIL"},
{"start": 33, "end": 49, "label": "PHONE"},
{"start": 61, "end": 74, "label": "PERSON"},
{"start": 81, "end": 91, "label": "DATE"}
]
}
```
## Intended Use
The dataset is optimized for:
Training PII NER models (RoBERTa, DeBERTa, Electra, etc.)
Building LLM privacy and redaction filters
Chrome extensions that mask PII before sending text to chatbots
Data-loss prevention systems
Benchmarking robustness to obfuscation + noise
## Limitations
Fully synthetic — rare real-world formats may still be missing
No coreference (e.g., linking “he” to a PERSON)
In-span noise can alter offsets; downstream systems should handle mapping carefully
## Ethical Considerations
Contains no real PII
Designed to improve privacy, compliance, and safety
MIT license allows academic and commercial use
## Citation
@dataset{tursunait2025_piisynth,
author = {Turumbekova, Tursunai},
title = {RoBERTa PII Synthetic Dataset},
year = {2025},
url = {https://huggingface.co/datasets/tursunait/RoBERTa-pii-synth}
}
## Contact
Tursunai Turumbekova
GitHub: https://github.com/tursunait
---
提供机构:
tursunait



