tursunait/roberta-pii-synth

Name: tursunait/roberta-pii-synth
Creator: tursunait
Published: 2025-12-08 23:57:52
License: 暂无描述

Hugging Face2025-12-08 更新2025-12-20 收录

下载链接：

https://hf-mirror.com/datasets/tursunait/roberta-pii-synth

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit datasets: - tursunait/RoBERTa-pii-synth language: - en tags: - pii - ner - synthetic-data - token-classification - deidentification - privacy - nlp task_categories: - token-classification task_ids: - named-entity-recognition pretty_name: RoBERTa PII Synthetic Dataset size_categories: - 100K<n<1M --- # **Synthetic PII Detection Dataset (RoBERTa-PII-Synth)** *A large-scale, fully synthetic dataset for training token-classification models to detect Personally Identifiable Information (PII) in realistic text.* This dataset was built using an **enhanced synthetic generation pipeline**, designed to better capture the linguistic and formatting variability of real-world user text. All samples are **fully artificial** — no real people or identifiers appear anywhere. --- # **📘 Dataset Summary** **RoBERTa-PII-Synth** contains **120k+ synthetic examples**, each with: - Natural-language text (short, medium, or long multi-sentence samples) - Character-level PII span annotations - Tokenized features for RoBERTa (`tokens`, `input_ids`, `attention_mask`, `labels`) - A diverse set of entity types: - `PERSON`, `EMAIL`, `PHONE`, `ORG`, `ADDRESS`, - `DATE`, `CREDIT_CARD`, `SSN`, **`AGE` (new)** The dataset includes: ✔ **Obfuscated PII** (e.g., `john[at]gmail[dot]com`, spaced-out phone numbers, misspellings) ✔ **Heavy format diversity** (usernames, international phone formats, dotted/space-separated SSNs) ✔ **Noise injection** (length-preserving noise outside entities; realistic corruption inside entities) ✔ **Hard negatives** (GUIDs, MAC addresses, SHA1 hashes, invalid credit card numbers) ✔ **Clean all-O examples** (realistic non-PII text for improving precision) --- # **📁 Dataset Structure** ### **Splits** | Split | Samples | |-------|---------| | **Train** | ~96,000 | | **Validation** | ~12,000 | | **Test** | ~12,000 | --- ### **Features** | Feature | Type | Description | |---------|------|-------------| | `text` | `string` | Raw synthetic text | | `spans` | `list[{start,end,label}]` | Character-level entity annotations | | `tokens` | `list[string]` | Word-level tokens (RoBERTa tokenizer) | | `input_ids` | `list[int]` | RoBERTa token IDs | | `attention_mask` | `list[int]` | Mask for valid tokens | | `labels` | `list[int]` | Token classification labels (BILOU-coded) | --- # **📥 How to Load the Dataset** ```python from datasets import load_dataset ds = load_dataset("tursunait/RoBERTa-pii-synth") train = ds["train"] val = ds["validation"] test = ds["test"] ``` Inspect sample: ```pyhton sample = train[0] sample ``` Example sample: ```json { "text": "Contact kees.guirard@aol.com or +31 880 385 2406. Applicant: John D. Smith, DOB 1990-05-15.", "spans": [ {"start": 8, "end": 29, "label": "EMAIL"}, {"start": 33, "end": 49, "label": "PHONE"}, {"start": 61, "end": 74, "label": "PERSON"}, {"start": 81, "end": 91, "label": "DATE"} ] } ``` ## Intended Use The dataset is optimized for: Training PII NER models (RoBERTa, DeBERTa, Electra, etc.) Building LLM privacy and redaction filters Chrome extensions that mask PII before sending text to chatbots Data-loss prevention systems Benchmarking robustness to obfuscation + noise ## Limitations Fully synthetic — rare real-world formats may still be missing No coreference (e.g., linking “he” to a PERSON) In-span noise can alter offsets; downstream systems should handle mapping carefully ## Ethical Considerations Contains no real PII Designed to improve privacy, compliance, and safety MIT license allows academic and commercial use ## Citation @dataset{tursunait2025_piisynth, author = {Turumbekova, Tursunai}, title = {RoBERTa PII Synthetic Dataset}, year = {2025}, url = {https://huggingface.co/datasets/tursunait/RoBERTa-pii-synth} } ## Contact Tursunai Turumbekova GitHub: https://github.com/tursunait ---

提供机构：

tursunait

5,000+

优质数据集

54 个

任务类型

进入经典数据集