ai4privacy/openpii-masking-mini-10k
收藏Hugging Face2026-04-04 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/ai4privacy/openpii-masking-mini-10k
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
- fr
- de
- es
- it
- nl
- bg
- cs
- da
- el
- et
- fi
- hr
- hu
- lt
- lv
- pl
- pt
- ro
- sk
- sl
- sr
- sv
license: cc-by-4.0
size_categories:
- 1K<n<10K
source_datasets:
- ai4privacy/pii-masking-openpii-1m
task_categories:
- token-classification
- text-generation
pretty_name: OpenPII Masking Mini 10K — Multilingual PII Masking Dataset (19 Labels, 23 Languages)
tags:
- privacy
- pii
- sensitive-data
- data-masking
- data-anonymization
- ner
- synthetic
- multilingual
- ai4privacy
- openpii
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- split: validation
path: data/validation-*
---
# OpenPII Masking Mini 10K
A compact, stratified subset of [ai4privacy/pii-masking-openpii-1m](https://huggingface.co/datasets/ai4privacy/pii-masking-openpii-1m), containing **10,000 samples** for rapid experimentation, fine-tuning, and benchmarking of PII detection and masking models.
## Dataset Description
- **Source**: `ai4privacy/pii-masking-openpii-1m` (1,428,143 samples)
- **Subset size**: 10,000 samples (9,000 train / 1,000 validation)
- **Languages**: 23
- **PII entity types**: 19
- **Random seed**: 42
## Sampling Methodology
Samples were selected using **proportional stratified sampling by language**:
1. Target count per language = `round(lang_proportion × 10,000)` — proportional representation.
2. Streaming + reservoir sampling collected 3× the target candidates per language.
3. Fixed random seed `42` applied at every step for full reproducibility.
4. Final samples shuffled and split 90/10 into train/validation.
## Language Distribution
| Language | Code | Samples |
|----------|------|---------|
| EN | en | 1256 |
| FR | fr | 918 |
| DE | de | 841 |
| ES | es | 632 |
| IT | it | 625 |
| BG | bg | 328 |
| PL | pl | 326 |
| CS | cs | 325 |
| ET | et | 321 |
| LT | lt | 321 |
| SV | sv | 321 |
| LV | lv | 320 |
| SK | sk | 320 |
| HU | hu | 318 |
| FI | fi | 317 |
| RO | ro | 317 |
| DA | da | 316 |
| EL | el | 316 |
| HR | hr | 315 |
| SL | sl | 315 |
| SR | sr | 314 |
| NL | nl | 311 |
| PT | pt | 307 |
## Schema
| Field | Type | Description |
|-------|------|-------------|
| `source_text` | string | Original text with real PII |
| `masked_text` | string | Text with PII replaced by `[LABEL_N]` tokens |
| `privacy_mask` | list[dict] | Annotations: label, start, end, value, label_index |
| `uid` | int | Unique identifier |
| `language` | string | ISO 639-1 language code |
| `region` | string | ISO 3166-1 region code |
| `script` | string | Script type (e.g., Latn, Cyrl) |
| `mbert_tokens` | list[str] | mBERT tokenization |
| `mbert_token_classes` | list[str] | BIO NER labels per token |
## PII Entity Types (19)
`AGE`, `BUILDINGNUM`, `CITY`, `CREDITCARDNUMBER`, `DATE`, `DRIVERLICENSENUM`, `EMAIL`, `GENDER`, `GIVENNAME`, `IDCARDNUM`, `PASSPORTNUM`, `SEX`, `SOCIALNUM`, `STREET`, `SURNAME`, `TAXNUM`, `TELEPHONENUM`, `TITLE`, `ZIPCODE`
## Usage
```python
from datasets import load_dataset
ds = load_dataset("ai4privacy/openpii-masking-mini-10k")
for sample in ds["train"]:
print(sample["source_text"])
print(sample["privacy_mask"])
break
```
## p5y Data Analytics
This dataset is built on the [p5y](https://p5y.org) framework - think of it as i18n but for privacy. Just as i18n (internationalization) translates content into different locales, p5y translates sensitive data into privacy-safe formats through a standardized 3-step approach:
1. **Awareness** - Scan and markup private entities in unstructured text, producing a structured privacy mask with entity types, distribution, density, and risk assessment.
2. **Protection** - Control identified personal data through masking, pseudonymization, or k-anonymization, tailored to the specific use case and regulatory requirements.
3. **Quality Assurance** - Measure remaining privacy risk after anonymization, evaluating de-anonymization risks through expert annotation and automated assessment.
Learn more at [p5y.org](https://p5y.org)
---
## License
[CC BY 4.0](https://creativecommons.org/licenses/by/4.0/) — same as source dataset.
## Citation
If you use this dataset, please cite the original:
```bibtex
@dataset{ai4privacy_openpii_1m,
author = {AI4Privacy},
title = {OpenPII 1M — Multilingual PII Masking Dataset},
year = {2024},
publisher = {Hugging Face},
doi = {10.57967/hf/8202},
url = {https://huggingface.co/datasets/ai4privacy/pii-masking-openpii-1m}
}
```
提供机构:
ai4privacy



