ai4privacy/openpii-masking-nano-1k
收藏Hugging Face2026-04-04 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/ai4privacy/openpii-masking-nano-1k
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
language:
- en
- fr
- de
- es
- it
- nl
- bg
- cs
- da
- el
- et
- fi
- hr
- hu
- lt
- lv
- pl
- pt
- ro
- sk
- sl
- sr
- sv
task_categories:
- token-classification
tags:
- privacy
- pii
- sensitive-data
- data-masking
- data-anonymization
- ner
- synthetic
- multilingual
- ai4privacy
- openpii
- benchmark
- evaluation
- gdpr
pretty_name: OpenPII-Masking-Nano-1K - Multilingual PII Detection Benchmark
size_categories:
- 1K<n<10K
source_datasets:
- ai4privacy/openpii-masking-mini-10k
- ai4privacy/pii-masking-openpii-1m
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
dataset_info:
features:
- name: source_text
dtype: string
- name: masked_text
dtype: string
- name: privacy_mask
list:
- name: end
dtype: int64
- name: label
dtype: string
- name: label_index
dtype: int64
- name: start
dtype: int64
- name: value
dtype: string
- name: split
dtype: string
- name: uid
dtype: int64
- name: language
dtype: string
- name: region
dtype: string
- name: script
dtype: string
- name: mbert_tokens
list: string
- name: mbert_token_classes
list: string
splits:
- name: train
num_bytes: 3347584
num_examples: 1000
download_size: 1231736
dataset_size: 3347584
---
# OpenPII-Masking-Nano-1K - The Fast PII Detection Benchmark
<p align="center">
<img src="assets/europe_language_map.png" alt="Geographic Coverage - 29 Regions, 23 Languages" width="480"/>
</p>
## A compact, provider-agnostic benchmark for quick PII detection evaluation.
The little sibling of [OpenPII-Masking-Mini-10K](https://huggingface.co/datasets/ai4privacy/openpii-masking-mini-10k). Same methodology, same 23 languages, same 19 entity types - just 1K samples for rapid iteration, CI/CD pipelines, and quick provider comparisons.
| Property | Value |
|:---|:---|
| **Total Samples** | 1,000 |
| **Languages** | 23 |
| **Regions** | 29 |
| **Scripts** | 3 (Latin, Cyrillic, Greek) |
| **Entity Types** | 19 |
| **Total Annotations** | 8,100 |
| **Avg Entities / Sample** | 8.1 |
| **PII Coverage** | 100% (every sample contains PII) |
| **Source** | [OpenPII-Masking-Mini-10K](https://huggingface.co/datasets/ai4privacy/openpii-masking-mini-10k) / [OpenPII-Masking-1M](https://huggingface.co/datasets/ai4privacy/pii-masking-openpii-1m) |
| **Sampling** | Stratified by language, seed=42 |
| **License** | CC-BY-4.0 |
## Geographic & Language Coverage
The benchmark spans **29 regions** - 25 European countries plus Canada (CA), United States (US), Mexico (MX), and India (IN) - covering **3 script families** (Latin, Cyrillic, Greek).
<p align="center">
<img src="assets/language_distribution.png" alt="Language Distribution" width="800"/>
</p>
## Label Taxonomy (19 Entity Types)
<p align="center">
<img src="assets/label_distribution.png" alt="PII Entity Distribution" width="800"/>
</p>
| Label | Count | % |
|:---|---:|---:|
| `GIVENNAME` | 975 | 12.0% |
| `DATE` | 969 | 12.0% |
| `SURNAME` | 814 | 10.0% |
| `EMAIL` | 565 | 7.0% |
| `CITY` | 562 | 6.9% |
| `TITLE` | 473 | 5.8% |
| `TELEPHONENUM` | 452 | 5.6% |
| `AGE` | 415 | 5.1% |
| `STREET` | 368 | 4.5% |
| `BUILDINGNUM` | 356 | 4.4% |
| `ZIPCODE` | 335 | 4.1% |
| `IDCARDNUM` | 263 | 3.2% |
| `GENDER` | 259 | 3.2% |
| `CREDITCARDNUMBER` | 238 | 2.9% |
| `SEX` | 237 | 2.9% |
| `DRIVERLICENSENUM` | 232 | 2.9% |
| `TAXNUM` | 214 | 2.6% |
| `SOCIALNUM` | 204 | 2.5% |
| `PASSPORTNUM` | 169 | 2.1% |
**Total: 8,100 entities across 1,000 samples (avg 8.1 per sample)**
## Quick Start
```python
from datasets import load_dataset
ds = load_dataset("ai4privacy/openpii-masking-nano-1k", split="train")
print(f"{len(ds)} samples, {len(set(ds['language']))} languages")
```
## When to use 1K vs 10K
| Use case | 1K | 10K |
|----------|:--:|:---:|
| CI/CD regression tests | x | |
| Quick provider comparison | x | |
| Development iteration | x | |
| Published benchmarks | | x |
| Per-language analysis | | x |
| Statistically robust results | | x |
## Related Datasets
| Dataset | Size | Use case |
|:---|:---|:---|
| [**OpenPII-Masking-Nano-1K**](https://huggingface.co/datasets/ai4privacy/openpii-masking-nano-1k) | 1K | Fast iteration & CI/CD |
| [**OpenPII-Masking-Mini-10K**](https://huggingface.co/datasets/ai4privacy/openpii-masking-mini-10k) | 10K | Published benchmarks |
| [**OpenPII-Masking-1M**](https://huggingface.co/datasets/ai4privacy/pii-masking-openpii-1m) | 1.4M | Training & full evaluation |
---
## p5y Data Analytics
This dataset is built on the [p5y](https://p5y.org) framework - think of it as i18n but for privacy. Just as i18n (internationalization) translates content into different locales, p5y translates sensitive data into privacy-safe formats through a standardized 3-step approach:
1. **Awareness** - Scan and markup private entities in unstructured text, producing a structured privacy mask with entity types, distribution, density, and risk assessment.
2. **Protection** - Control identified personal data through masking, pseudonymization, or k-anonymization, tailored to the specific use case and regulatory requirements.
3. **Quality Assurance** - Measure remaining privacy risk after anonymization, evaluating de-anonymization risks through expert annotation and automated assessment.
Learn more at [p5y.org](https://p5y.org)
---
## About Ai4Privacy
At Ai4Privacy, we are building the global seatbelt for Artificial Intelligence - enabling innovation while safeguarding personal information.
* **Website:** [www.Ai4Privacy.com](https://www.ai4privacy.com)
* **Community:** [Discord](https://discord.gg/kxSbJrUQZF)
---
## Licensing
* **License:** [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/). Copyright 2026 Ai Suisse SA.
* **Attribution:** Credit "Ai4Privacy / Ai Suisse SA" and link to this repository.
* This dataset contains **synthetic PII only** - no real personal data.
```bibtex
@dataset{ai4privacy_openpii_nano_1k_2026,
author = {Ai4Privacy},
title = {OpenPII-Masking-Nano-1K - Multilingual PII Detection Benchmark},
year = 2026,
publisher = {Hugging Face},
url = {https://huggingface.co/datasets/ai4privacy/openpii-masking-nano-1k}
}
```
Ai4Privacy is a project affiliated with [Ai Suisse SA](https://www.aisuisse.com/).
---
许可证:CC-BY-4.0
语言:
- 英语
- 法语
- 德语
- 西班牙语
- 意大利语
- 荷兰语
- 保加利亚语
- 捷克语
- 丹麦语
- 希腊语
- 爱沙尼亚语
- 芬兰语
- 克罗地亚语
- 匈牙利语
- 立陶宛语
- 拉脱维亚语
- 波兰语
- 葡萄牙语
- 罗马尼亚语
- 斯洛伐克语
- 斯洛文尼亚语
- 塞尔维亚语
- 瑞典语
任务类别:
- Token分类(Token-classification)
标签:
- 隐私
- PII(个人可识别信息,Personal Identifiable Information)
- 敏感数据
- 数据掩码(data-masking)
- 数据匿名化(data-anonymization)
- 命名实体识别(NER,Named Entity Recognition)
- 合成数据集
- 多语言
- AI4Privacy
- OpenPII
- 基准测试
- 评估
- GDPR(通用数据保护条例,General Data Protection Regulation)
展示名称:OpenPII-Masking-Nano-1K - 多语言PII检测基准测试
样本规模分类:
- 1K<n<10K
源数据集:
- ai4privacy/openpii-masking-mini-10k
- ai4privacy/pii-masking-openpii-1m
配置项:
- 配置名称:default
数据文件:
- 拆分:训练集
路径:data/train-*
数据集信息:
特征:
- 名称:source_text,数据类型:字符串
- 名称:masked_text,数据类型:字符串
- 名称:privacy_mask,列表类型,包含:
- 名称:end,数据类型:64位整数
- 名称:label,数据类型:字符串
- 名称:label_index,数据类型:64位整数
- 名称:start,数据类型:64位整数
- 名称:value,数据类型:字符串
- 名称:split,数据类型:字符串
- 名称:uid,数据类型:64位整数
- 名称:language,数据类型:字符串
- 名称:region,数据类型:字符串
- 名称:script,数据类型:字符串
- 名称:mbert_tokens,多语言BERT(Multilingual BERT)Token列表,字符串类型
- 名称:mbert_token_classes,多语言BERT Token类别列表,字符串类型
数据拆分:
- 名称:训练集,字节数:3347584,样本数:1000
下载大小:1231736
数据集总大小:3347584
---
# OpenPII-Masking-Nano-1K - 快速PII检测基准测试
<p align="center">
<img src="assets/europe_language_map.png" alt="覆盖范围——29个地区、23种语言" width="480"/>
</p>
## 一款紧凑、与提供商无关的快速PII检测评估基准
本数据集是[OpenPII-Masking-Mini-10K](https://huggingface.co/datasets/ai4privacy/openpii-masking-mini-10k)的轻量版本。采用相同的方法论、覆盖23种语言与19种实体类型,仅包含1000个样本,适用于快速迭代、CI/CD(持续集成/持续交付)流水线与快速提供商对比。
| 属性 | 数值 |
|:---|:---|
| **总样本数** | 1,000 |
| **语言数量** | 23 |
| **地区数量** | 29 |
| **书写系统** | 3种(拉丁语、西里尔语、希腊语) |
| **实体类型数量** | 19 |
| **总标注数** | 8,100 |
| **平均每个样本的实体数** | 8.1 |
| **PII覆盖度** | 100%(所有样本均包含PII) |
| **数据源** | [OpenPII-Masking-Mini-10K](https://huggingface.co/datasets/ai4privacy/openpii-masking-mini-10k) / [OpenPII-Masking-1M](https://huggingface.co/datasets/ai4privacy/pii-masking-openpii-1m) |
| **采样策略** | 按语言分层,随机种子=42 |
| **许可证** | CC-BY-4.0 |
## 地理与语言覆盖范围
本基准覆盖**29个地区**——25个欧洲国家外加加拿大(CA)、美国(US)、墨西哥(MX)与印度(IN)——涵盖**3种书写系统家族**(拉丁语、西里尔语、希腊语)。
<p align="center">
<img src="assets/language_distribution.png" alt="语言分布" width="800"/>
</p>
## 标签分类体系(19种实体类型)
<p align="center">
<img src="assets/label_distribution.png" alt="PII实体分布" width="800"/>
</p>
| 标签 | 计数 | 占比 |
|:---|---:|---:|
| `GIVENNAME`(名字) | 975 | 12.0% |
| `DATE`(日期) | 969 | 12.0% |
| `SURNAME`(姓氏) | 814 | 10.0% |
| `EMAIL`(电子邮箱) | 565 | 7.0% |
| `CITY`(城市) | 562 | 6.9% |
| `TITLE`(头衔) | 473 | 5.8% |
| `TELEPHONENUM`(电话号码) | 452 | 5.6% |
| `AGE`(年龄) | 415 | 5.1% |
| `STREET`(街道) | 368 | 4.5% |
| `BUILDINGNUM`(门牌号) | 356 | 4.4% |
| `ZIPCODE`(邮政编码) | 335 | 4.1% |
| `IDCARDNUM`(身份证号) | 263 | 3.2% |
| `GENDER`(性别) | 259 | 3.2% |
| `CREDITCARDNUMBER`(信用卡号) | 238 | 2.9% |
| `SEX`(性别) | 237 | 2.9% |
| `DRIVERLICENSENUM`(驾驶证号) | 232 | 2.9% |
| `TAXNUM`(税号) | 214 | 2.6% |
| `SOCIALNUM`(社保号) | 204 | 2.5% |
| `PASSPORTNUM`(护照号) | 169 | 2.1% |
**总计:1000个样本中共包含8100个实体,平均每个样本8.1个实体**
## 快速入门
python
from datasets import load_dataset
ds = load_dataset("ai4privacy/openpii-masking-nano-1k", split="train")
print(f"{len(ds)} 个样本,{len(set(ds['language']))} 种语言")
## 何时选择1K样本版本 vs 10K样本版本
| 使用场景 | 1K版本 | 10K版本 |
|----------|:--:|:---:|
| CI/CD回归测试 | ✔ | |
| 快速提供商对比 | ✔ | |
| 开发迭代 | ✔ | |
| 正式发布基准测试 | | ✔ |
| 逐语言分析 | | ✔ |
| 统计稳健性结果 | | ✔ |
## 相关数据集
| 数据集 | 规模 | 适用场景 |
|:---|:---|:---|
| [**OpenPII-Masking-Nano-1K**](https://huggingface.co/datasets/ai4privacy/openpii-masking-nano-1k) | 1K | 快速迭代与CI/CD |
| [**OpenPII-Masking-Mini-10K**](https://huggingface.co/datasets/ai4privacy/openpii-masking-mini-10k) | 10K | 正式发布基准测试 |
| [**OpenPII-Masking-1M**](https://huggingface.co/datasets/ai4privacy/pii-masking-openpii-1m) | 140万 | 模型训练与全量评估 |
---
## p5y 数据分析框架
本数据集基于[p5y](https://p5y.org)框架构建——可将其视为面向隐私的国际化(i18n)。正如i18n(国际化)将内容翻译为不同地域语言,p5y通过标准化的三步流程将敏感数据转换为隐私安全格式:
1. **感知阶段**:扫描并标记非结构化文本中的私有实体,生成包含实体类型、分布、密度与风险评估的结构化隐私掩码。
2. **保护阶段**:根据特定用例与监管要求,通过掩码、假名化或k-匿名化等方式管控已识别的个人数据。
3. **质量保证阶段**:衡量匿名化后剩余的隐私风险,通过专家标注与自动化评估检测去匿名化风险。
更多信息请访问[p5y.org](https://p5y.org)
---
## 关于Ai4Privacy
Ai4Privacy致力于打造人工智能的全球安全防护网——在推动创新的同时保护个人信息。
* **官网**:[www.Ai4Privacy.com](https://www.ai4privacy.com)
* **社区**:[Discord](https://discord.gg/kxSbJrUQZF)
---
## 许可协议
* **许可证**:[CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/)。版权所有2026 Ai Suisse SA。
* **署名要求**:请注明"Ai4Privacy / Ai Suisse SA"并链接至本仓库。
* 本数据集仅包含**合成PII**——无真实个人数据。
bibtex
@dataset{ai4privacy_openpii_nano_1k_2026,
author = {Ai4Privacy},
title = {OpenPII-Masking-Nano-1K - Multilingual PII Detection Benchmark},
year = 2026,
publisher = {Hugging Face},
url = {https://huggingface.co/datasets/ai4privacy/openpii-masking-nano-1k}
}
Ai4Privacy是隶属于[Ai Suisse SA](https://www.aisuisse.com/)的项目。
提供机构:
ai4privacy



