ai4privacy/openpii-masking-mini-10k

Name: ai4privacy/openpii-masking-mini-10k
Creator: ai4privacy
Published: 2026-04-04 16:18:03
License: 暂无描述

Hugging Face2026-04-04 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/ai4privacy/openpii-masking-mini-10k

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en - fr - de - es - it - nl - bg - cs - da - el - et - fi - hr - hu - lt - lv - pl - pt - ro - sk - sl - sr - sv license: cc-by-4.0 size_categories: - 1K<n<10K source_datasets: - ai4privacy/pii-masking-openpii-1m task_categories: - token-classification - text-generation pretty_name: OpenPII Masking Mini 10K — Multilingual PII Masking Dataset (19 Labels, 23 Languages) tags: - privacy - pii - sensitive-data - data-masking - data-anonymization - ner - synthetic - multilingual - ai4privacy - openpii configs: - config_name: default data_files: - split: train path: data/train-* - split: validation path: data/validation-* --- # OpenPII Masking Mini 10K A compact, stratified subset of [ai4privacy/pii-masking-openpii-1m](https://huggingface.co/datasets/ai4privacy/pii-masking-openpii-1m), containing **10,000 samples** for rapid experimentation, fine-tuning, and benchmarking of PII detection and masking models. ## Dataset Description - **Source**: `ai4privacy/pii-masking-openpii-1m` (1,428,143 samples) - **Subset size**: 10,000 samples (9,000 train / 1,000 validation) - **Languages**: 23 - **PII entity types**: 19 - **Random seed**: 42 ## Sampling Methodology Samples were selected using **proportional stratified sampling by language**: 1. Target count per language = `round(lang_proportion × 10,000)` — proportional representation. 2. Streaming + reservoir sampling collected 3× the target candidates per language. 3. Fixed random seed `42` applied at every step for full reproducibility. 4. Final samples shuffled and split 90/10 into train/validation. ## Language Distribution | Language | Code | Samples | |----------|------|---------| | EN | en | 1256 | | FR | fr | 918 | | DE | de | 841 | | ES | es | 632 | | IT | it | 625 | | BG | bg | 328 | | PL | pl | 326 | | CS | cs | 325 | | ET | et | 321 | | LT | lt | 321 | | SV | sv | 321 | | LV | lv | 320 | | SK | sk | 320 | | HU | hu | 318 | | FI | fi | 317 | | RO | ro | 317 | | DA | da | 316 | | EL | el | 316 | | HR | hr | 315 | | SL | sl | 315 | | SR | sr | 314 | | NL | nl | 311 | | PT | pt | 307 | ## Schema | Field | Type | Description | |-------|------|-------------| | `source_text` | string | Original text with real PII | | `masked_text` | string | Text with PII replaced by `[LABEL_N]` tokens | | `privacy_mask` | list[dict] | Annotations: label, start, end, value, label_index | | `uid` | int | Unique identifier | | `language` | string | ISO 639-1 language code | | `region` | string | ISO 3166-1 region code | | `script` | string | Script type (e.g., Latn, Cyrl) | | `mbert_tokens` | list[str] | mBERT tokenization | | `mbert_token_classes` | list[str] | BIO NER labels per token | ## PII Entity Types (19) `AGE`, `BUILDINGNUM`, `CITY`, `CREDITCARDNUMBER`, `DATE`, `DRIVERLICENSENUM`, `EMAIL`, `GENDER`, `GIVENNAME`, `IDCARDNUM`, `PASSPORTNUM`, `SEX`, `SOCIALNUM`, `STREET`, `SURNAME`, `TAXNUM`, `TELEPHONENUM`, `TITLE`, `ZIPCODE` ## Usage ```python from datasets import load_dataset ds = load_dataset("ai4privacy/openpii-masking-mini-10k") for sample in ds["train"]: print(sample["source_text"]) print(sample["privacy_mask"]) break ``` ## p5y Data Analytics This dataset is built on the [p5y](https://p5y.org) framework - think of it as i18n but for privacy. Just as i18n (internationalization) translates content into different locales, p5y translates sensitive data into privacy-safe formats through a standardized 3-step approach: 1. **Awareness** - Scan and markup private entities in unstructured text, producing a structured privacy mask with entity types, distribution, density, and risk assessment. 2. **Protection** - Control identified personal data through masking, pseudonymization, or k-anonymization, tailored to the specific use case and regulatory requirements. 3. **Quality Assurance** - Measure remaining privacy risk after anonymization, evaluating de-anonymization risks through expert annotation and automated assessment. Learn more at [p5y.org](https://p5y.org) --- ## License [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/) — same as source dataset. ## Citation If you use this dataset, please cite the original: ```bibtex @dataset{ai4privacy_openpii_1m, author = {AI4Privacy}, title = {OpenPII 1M — Multilingual PII Masking Dataset}, year = {2024}, publisher = {Hugging Face}, doi = {10.57967/hf/8202}, url = {https://huggingface.co/datasets/ai4privacy/pii-masking-openpii-1m} } ```

提供机构：

ai4privacy

5,000+

优质数据集

54 个

任务类型

进入经典数据集