five

ai4privacy/openpii-masking-nano-1k

收藏
Hugging Face2026-04-04 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/ai4privacy/openpii-masking-nano-1k
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 language: - en - fr - de - es - it - nl - bg - cs - da - el - et - fi - hr - hu - lt - lv - pl - pt - ro - sk - sl - sr - sv task_categories: - token-classification tags: - privacy - pii - sensitive-data - data-masking - data-anonymization - ner - synthetic - multilingual - ai4privacy - openpii - benchmark - evaluation - gdpr pretty_name: OpenPII-Masking-Nano-1K - Multilingual PII Detection Benchmark size_categories: - 1K<n<10K source_datasets: - ai4privacy/openpii-masking-mini-10k - ai4privacy/pii-masking-openpii-1m configs: - config_name: default data_files: - split: train path: data/train-* dataset_info: features: - name: source_text dtype: string - name: masked_text dtype: string - name: privacy_mask list: - name: end dtype: int64 - name: label dtype: string - name: label_index dtype: int64 - name: start dtype: int64 - name: value dtype: string - name: split dtype: string - name: uid dtype: int64 - name: language dtype: string - name: region dtype: string - name: script dtype: string - name: mbert_tokens list: string - name: mbert_token_classes list: string splits: - name: train num_bytes: 3347584 num_examples: 1000 download_size: 1231736 dataset_size: 3347584 --- # OpenPII-Masking-Nano-1K - The Fast PII Detection Benchmark <p align="center"> <img src="assets/europe_language_map.png" alt="Geographic Coverage - 29 Regions, 23 Languages" width="480"/> </p> ## A compact, provider-agnostic benchmark for quick PII detection evaluation. The little sibling of [OpenPII-Masking-Mini-10K](https://huggingface.co/datasets/ai4privacy/openpii-masking-mini-10k). Same methodology, same 23 languages, same 19 entity types - just 1K samples for rapid iteration, CI/CD pipelines, and quick provider comparisons. | Property | Value | |:---|:---| | **Total Samples** | 1,000 | | **Languages** | 23 | | **Regions** | 29 | | **Scripts** | 3 (Latin, Cyrillic, Greek) | | **Entity Types** | 19 | | **Total Annotations** | 8,100 | | **Avg Entities / Sample** | 8.1 | | **PII Coverage** | 100% (every sample contains PII) | | **Source** | [OpenPII-Masking-Mini-10K](https://huggingface.co/datasets/ai4privacy/openpii-masking-mini-10k) / [OpenPII-Masking-1M](https://huggingface.co/datasets/ai4privacy/pii-masking-openpii-1m) | | **Sampling** | Stratified by language, seed=42 | | **License** | CC-BY-4.0 | ## Geographic & Language Coverage The benchmark spans **29 regions** - 25 European countries plus Canada (CA), United States (US), Mexico (MX), and India (IN) - covering **3 script families** (Latin, Cyrillic, Greek). <p align="center"> <img src="assets/language_distribution.png" alt="Language Distribution" width="800"/> </p> ## Label Taxonomy (19 Entity Types) <p align="center"> <img src="assets/label_distribution.png" alt="PII Entity Distribution" width="800"/> </p> | Label | Count | % | |:---|---:|---:| | `GIVENNAME` | 975 | 12.0% | | `DATE` | 969 | 12.0% | | `SURNAME` | 814 | 10.0% | | `EMAIL` | 565 | 7.0% | | `CITY` | 562 | 6.9% | | `TITLE` | 473 | 5.8% | | `TELEPHONENUM` | 452 | 5.6% | | `AGE` | 415 | 5.1% | | `STREET` | 368 | 4.5% | | `BUILDINGNUM` | 356 | 4.4% | | `ZIPCODE` | 335 | 4.1% | | `IDCARDNUM` | 263 | 3.2% | | `GENDER` | 259 | 3.2% | | `CREDITCARDNUMBER` | 238 | 2.9% | | `SEX` | 237 | 2.9% | | `DRIVERLICENSENUM` | 232 | 2.9% | | `TAXNUM` | 214 | 2.6% | | `SOCIALNUM` | 204 | 2.5% | | `PASSPORTNUM` | 169 | 2.1% | **Total: 8,100 entities across 1,000 samples (avg 8.1 per sample)** ## Quick Start ```python from datasets import load_dataset ds = load_dataset("ai4privacy/openpii-masking-nano-1k", split="train") print(f"{len(ds)} samples, {len(set(ds['language']))} languages") ``` ## When to use 1K vs 10K | Use case | 1K | 10K | |----------|:--:|:---:| | CI/CD regression tests | x | | | Quick provider comparison | x | | | Development iteration | x | | | Published benchmarks | | x | | Per-language analysis | | x | | Statistically robust results | | x | ## Related Datasets | Dataset | Size | Use case | |:---|:---|:---| | [**OpenPII-Masking-Nano-1K**](https://huggingface.co/datasets/ai4privacy/openpii-masking-nano-1k) | 1K | Fast iteration & CI/CD | | [**OpenPII-Masking-Mini-10K**](https://huggingface.co/datasets/ai4privacy/openpii-masking-mini-10k) | 10K | Published benchmarks | | [**OpenPII-Masking-1M**](https://huggingface.co/datasets/ai4privacy/pii-masking-openpii-1m) | 1.4M | Training & full evaluation | --- ## p5y Data Analytics This dataset is built on the [p5y](https://p5y.org) framework - think of it as i18n but for privacy. Just as i18n (internationalization) translates content into different locales, p5y translates sensitive data into privacy-safe formats through a standardized 3-step approach: 1. **Awareness** - Scan and markup private entities in unstructured text, producing a structured privacy mask with entity types, distribution, density, and risk assessment. 2. **Protection** - Control identified personal data through masking, pseudonymization, or k-anonymization, tailored to the specific use case and regulatory requirements. 3. **Quality Assurance** - Measure remaining privacy risk after anonymization, evaluating de-anonymization risks through expert annotation and automated assessment. Learn more at [p5y.org](https://p5y.org) --- ## About Ai4Privacy At Ai4Privacy, we are building the global seatbelt for Artificial Intelligence - enabling innovation while safeguarding personal information. * **Website:** [www.Ai4Privacy.com](https://www.ai4privacy.com) * **Community:** [Discord](https://discord.gg/kxSbJrUQZF) --- ## Licensing * **License:** [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/). Copyright 2026 Ai Suisse SA. * **Attribution:** Credit "Ai4Privacy / Ai Suisse SA" and link to this repository. * This dataset contains **synthetic PII only** - no real personal data. ```bibtex @dataset{ai4privacy_openpii_nano_1k_2026, author = {Ai4Privacy}, title = {OpenPII-Masking-Nano-1K - Multilingual PII Detection Benchmark}, year = 2026, publisher = {Hugging Face}, url = {https://huggingface.co/datasets/ai4privacy/openpii-masking-nano-1k} } ``` Ai4Privacy is a project affiliated with [Ai Suisse SA](https://www.aisuisse.com/).

--- 许可证:CC-BY-4.0 语言: - 英语 - 法语 - 德语 - 西班牙语 - 意大利语 - 荷兰语 - 保加利亚语 - 捷克语 - 丹麦语 - 希腊语 - 爱沙尼亚语 - 芬兰语 - 克罗地亚语 - 匈牙利语 - 立陶宛语 - 拉脱维亚语 - 波兰语 - 葡萄牙语 - 罗马尼亚语 - 斯洛伐克语 - 斯洛文尼亚语 - 塞尔维亚语 - 瑞典语 任务类别: - Token分类(Token-classification) 标签: - 隐私 - PII(个人可识别信息,Personal Identifiable Information) - 敏感数据 - 数据掩码(data-masking) - 数据匿名化(data-anonymization) - 命名实体识别(NER,Named Entity Recognition) - 合成数据集 - 多语言 - AI4Privacy - OpenPII - 基准测试 - 评估 - GDPR(通用数据保护条例,General Data Protection Regulation) 展示名称:OpenPII-Masking-Nano-1K - 多语言PII检测基准测试 样本规模分类: - 1K<n<10K 源数据集: - ai4privacy/openpii-masking-mini-10k - ai4privacy/pii-masking-openpii-1m 配置项: - 配置名称:default 数据文件: - 拆分:训练集 路径:data/train-* 数据集信息: 特征: - 名称:source_text,数据类型:字符串 - 名称:masked_text,数据类型:字符串 - 名称:privacy_mask,列表类型,包含: - 名称:end,数据类型:64位整数 - 名称:label,数据类型:字符串 - 名称:label_index,数据类型:64位整数 - 名称:start,数据类型:64位整数 - 名称:value,数据类型:字符串 - 名称:split,数据类型:字符串 - 名称:uid,数据类型:64位整数 - 名称:language,数据类型:字符串 - 名称:region,数据类型:字符串 - 名称:script,数据类型:字符串 - 名称:mbert_tokens,多语言BERT(Multilingual BERT)Token列表,字符串类型 - 名称:mbert_token_classes,多语言BERT Token类别列表,字符串类型 数据拆分: - 名称:训练集,字节数:3347584,样本数:1000 下载大小:1231736 数据集总大小:3347584 --- # OpenPII-Masking-Nano-1K - 快速PII检测基准测试 <p align="center"> <img src="assets/europe_language_map.png" alt="覆盖范围——29个地区、23种语言" width="480"/> </p> ## 一款紧凑、与提供商无关的快速PII检测评估基准 本数据集是[OpenPII-Masking-Mini-10K](https://huggingface.co/datasets/ai4privacy/openpii-masking-mini-10k)的轻量版本。采用相同的方法论、覆盖23种语言与19种实体类型,仅包含1000个样本,适用于快速迭代、CI/CD(持续集成/持续交付)流水线与快速提供商对比。 | 属性 | 数值 | |:---|:---| | **总样本数** | 1,000 | | **语言数量** | 23 | | **地区数量** | 29 | | **书写系统** | 3种(拉丁语、西里尔语、希腊语) | | **实体类型数量** | 19 | | **总标注数** | 8,100 | | **平均每个样本的实体数** | 8.1 | | **PII覆盖度** | 100%(所有样本均包含PII) | | **数据源** | [OpenPII-Masking-Mini-10K](https://huggingface.co/datasets/ai4privacy/openpii-masking-mini-10k) / [OpenPII-Masking-1M](https://huggingface.co/datasets/ai4privacy/pii-masking-openpii-1m) | | **采样策略** | 按语言分层,随机种子=42 | | **许可证** | CC-BY-4.0 | ## 地理与语言覆盖范围 本基准覆盖**29个地区**——25个欧洲国家外加加拿大(CA)、美国(US)、墨西哥(MX)与印度(IN)——涵盖**3种书写系统家族**(拉丁语、西里尔语、希腊语)。 <p align="center"> <img src="assets/language_distribution.png" alt="语言分布" width="800"/> </p> ## 标签分类体系(19种实体类型) <p align="center"> <img src="assets/label_distribution.png" alt="PII实体分布" width="800"/> </p> | 标签 | 计数 | 占比 | |:---|---:|---:| | `GIVENNAME`(名字) | 975 | 12.0% | | `DATE`(日期) | 969 | 12.0% | | `SURNAME`(姓氏) | 814 | 10.0% | | `EMAIL`(电子邮箱) | 565 | 7.0% | | `CITY`(城市) | 562 | 6.9% | | `TITLE`(头衔) | 473 | 5.8% | | `TELEPHONENUM`(电话号码) | 452 | 5.6% | | `AGE`(年龄) | 415 | 5.1% | | `STREET`(街道) | 368 | 4.5% | | `BUILDINGNUM`(门牌号) | 356 | 4.4% | | `ZIPCODE`(邮政编码) | 335 | 4.1% | | `IDCARDNUM`(身份证号) | 263 | 3.2% | | `GENDER`(性别) | 259 | 3.2% | | `CREDITCARDNUMBER`(信用卡号) | 238 | 2.9% | | `SEX`(性别) | 237 | 2.9% | | `DRIVERLICENSENUM`(驾驶证号) | 232 | 2.9% | | `TAXNUM`(税号) | 214 | 2.6% | | `SOCIALNUM`(社保号) | 204 | 2.5% | | `PASSPORTNUM`(护照号) | 169 | 2.1% | **总计:1000个样本中共包含8100个实体,平均每个样本8.1个实体** ## 快速入门 python from datasets import load_dataset ds = load_dataset("ai4privacy/openpii-masking-nano-1k", split="train") print(f"{len(ds)} 个样本,{len(set(ds['language']))} 种语言") ## 何时选择1K样本版本 vs 10K样本版本 | 使用场景 | 1K版本 | 10K版本 | |----------|:--:|:---:| | CI/CD回归测试 | ✔ | | | 快速提供商对比 | ✔ | | | 开发迭代 | ✔ | | | 正式发布基准测试 | | ✔ | | 逐语言分析 | | ✔ | | 统计稳健性结果 | | ✔ | ## 相关数据集 | 数据集 | 规模 | 适用场景 | |:---|:---|:---| | [**OpenPII-Masking-Nano-1K**](https://huggingface.co/datasets/ai4privacy/openpii-masking-nano-1k) | 1K | 快速迭代与CI/CD | | [**OpenPII-Masking-Mini-10K**](https://huggingface.co/datasets/ai4privacy/openpii-masking-mini-10k) | 10K | 正式发布基准测试 | | [**OpenPII-Masking-1M**](https://huggingface.co/datasets/ai4privacy/pii-masking-openpii-1m) | 140万 | 模型训练与全量评估 | --- ## p5y 数据分析框架 本数据集基于[p5y](https://p5y.org)框架构建——可将其视为面向隐私的国际化(i18n)。正如i18n(国际化)将内容翻译为不同地域语言,p5y通过标准化的三步流程将敏感数据转换为隐私安全格式: 1. **感知阶段**:扫描并标记非结构化文本中的私有实体,生成包含实体类型、分布、密度与风险评估的结构化隐私掩码。 2. **保护阶段**:根据特定用例与监管要求,通过掩码、假名化或k-匿名化等方式管控已识别的个人数据。 3. **质量保证阶段**:衡量匿名化后剩余的隐私风险,通过专家标注与自动化评估检测去匿名化风险。 更多信息请访问[p5y.org](https://p5y.org) --- ## 关于Ai4Privacy Ai4Privacy致力于打造人工智能的全球安全防护网——在推动创新的同时保护个人信息。 * **官网**:[www.Ai4Privacy.com](https://www.ai4privacy.com) * **社区**:[Discord](https://discord.gg/kxSbJrUQZF) --- ## 许可协议 * **许可证**:[CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/)。版权所有2026 Ai Suisse SA。 * **署名要求**:请注明"Ai4Privacy / Ai Suisse SA"并链接至本仓库。 * 本数据集仅包含**合成PII**——无真实个人数据。 bibtex @dataset{ai4privacy_openpii_nano_1k_2026, author = {Ai4Privacy}, title = {OpenPII-Masking-Nano-1K - Multilingual PII Detection Benchmark}, year = 2026, publisher = {Hugging Face}, url = {https://huggingface.co/datasets/ai4privacy/openpii-masking-nano-1k} } Ai4Privacy是隶属于[Ai Suisse SA](https://www.aisuisse.com/)的项目。
提供机构:
ai4privacy
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作