five

open-pii-masking-500k-ai4privacy

收藏
NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://doi.org/10.7910/DVN/4H11OA
下载链接
链接失效反馈
官方服务:
资源简介:
# 🌍 World's largest open dataset for privacy masking 🌎 The dataset is useful to train and evaluate models to remove personally identifiable and sensitive information from text, especially in the context of AI assistants and LLMs. ![Task Showcase of Privacy Masking](assets/p5y_translation_example.png) # Dataset Analytics 📊 - ai4privacy/open-pii-masking-500k-ai4privacy ## p5y Data Analytics - **Total Entries**: 580,227 - **Total Tokens**: 19,199,982 - **Average Source Text Length**: 17.37 words - **Total PII Labels**: 5,705,973 - **Number of Unique PII Classes**: 20 (Open PII Labelset) - **Unique Identity Values**: 704,215 --- ## Language Distribution Analytics **Number of Unique Languages**: 8 | Language | Count | Percentage | |--------------------|----------|------------| | English (en) 🇺🇸🇬🇧🇨🇦🇮🇳 | 150,693 | 25.97% | | French (fr) 🇫🇷🇨🇭🇨🇦 | 112,136 | 19.33% | | German (de) 🇩🇪🇨🇭 | 82,384 | 14.20% | | Spanish (es) 🇪🇸 🇲🇽 | 78,013 | 13.45% | | Italian (it) 🇮🇹🇨🇭 | 68,824 | 11.86% | | Dutch (nl) 🇳🇱 | 26,628 | 4.59% | | Hindi (hi)* 🇮🇳 | 33,963 | 5.85% | | Telugu (te)* 🇮🇳 | 27,586 | 4.75% | *these languages are in experimental stages --- ## Region Distribution Analytics **Number of Unique Regions**: 11 | Region | Count | Percentage | |-----------------------|----------|------------| | Switzerland (CH) 🇨🇭 | 112,531 | 19.39% | | India (IN) 🇮🇳 | 99,724 | 17.19% | | Canada (CA) 🇨🇦 | 74,733 | 12.88% | | Germany (DE) 🇩🇪 | 41,604 | 7.17% | | Spain (ES) 🇪🇸 | 39,557 | 6.82% | | Mexico (MX) 🇲🇽 | 38,456 | 6.63% | | France (FR) 🇫🇷 | 37,886 | 6.53% | | Great Britain (GB) 🇬🇧 | 37,092 | 6.39% | | United States (US) 🇺🇸 | 37,008 | 6.38% | | Italy (IT) 🇮🇹 | 35,008 | 6.03% | | Netherlands (NL) 🇳🇱 | 26,628 | 4.59% | --- ## Machine Learning Task Analytics | Split | Count | Percentage | |-------------|----------|------------| | **Train** | 464,150 | 79.99% | | **Validate**| 116,077 | 20.01% | --- # Usage Option 1: Python ```terminal pip install datasets ``` ```python from datasets import load_dataset dataset = load_dataset("ai4privacy/open-pii-masking-500k-ai4privacy") ``` # Compatible Machine Learning Tasks: - Tokenclassification. Check out a HuggingFace's [guide on token classification](https://huggingface.co/docs/transformers/tasks/token_classification). - [ALBERT](https://huggingface.co/docs/transformers/model_doc/albert), [BERT](https://huggingface.co/docs/transformers/model_doc/bert), [BigBird](https://huggingface.co/docs/transformers/model_doc/big_bird), [BioGpt](https://huggingface.co/docs/transformers/model_doc/biogpt), [BLOOM](https://huggingface.co/docs/transformers/model_doc/bloom), [BROS](https://huggingface.co/docs/transformers/model_doc/bros), [CamemBERT](https://huggingface.co/docs/transformers/model_doc/camembert), [CANINE](https://huggingface.co/docs/transformers/model_doc/canine), [ConvBERT](https://huggingface.co/docs/transformers/model_doc/convbert), [Data2VecText](https://huggingface.co/docs/transformers/model_doc/data2vec-text), [DeBERTa](https://huggingface.co/docs/transformers/model_doc/deberta), [DeBERTa-v2](https://huggingface.co/docs/transformers/model_doc/deberta-v2), [DistilBERT](https://huggingface.co/docs/transformers/model_doc/distilbert), [ELECTRA](https://huggingface.co/docs/transformers/model_doc/electra), [ERNIE](https://huggingface.co/docs/transformers/model_doc/ernie), [ErnieM](https://huggingface.co/docs/transformers/model_doc/ernie_m), [ESM](https://huggingface.co/docs/transformers/model_doc/esm), [Falcon](https://huggingface.co/docs/transformers/model_doc/falcon), [FlauBERT](https://huggingface.co/docs/transformers/model_doc/flaubert), [FNet](https://huggingface.co/docs/transformers/model_doc/fnet), [Funnel Transformer](https://huggingface.co/docs/transformers/model_doc/funnel), [GPT-Sw3](https://huggingface.co/docs/transformers/model_doc/gpt-sw3), [OpenAI GPT-2](https://huggingface.co/docs/transformers/model_doc/gpt2), [GPTBigCode](https://huggingface.co/docs/transformers/model_doc/gpt_bigcode), [GPT Neo](https://huggingface.co/docs/transformers/model_doc/gpt_neo), [GPT NeoX](https://huggingface.co/docs/transformers/model_doc/gpt_neox), [I-BERT](https://huggingface.co/docs/transformers/model_doc/ibert), [LayoutLM](https://huggingface.co/docs/transformers/model_doc/layoutlm), [LayoutLMv2](https://huggingface.co/docs/transformers/model_doc/layoutlmv2), [LayoutLMv3](https://huggingface.co/docs/transformers/model_doc/layoutlmv3), [LiLT](https://huggingface.co/docs/transformers/model_doc/lilt), [Longformer](https://huggingface.co/docs/transformers/model_doc/longformer), [LUKE](https://huggingface.co/docs/transformers/model_doc/luke), [MarkupLM](https://huggingface.co/docs/transformers/model_doc/markuplm), [MEGA](https://huggingface.co/docs/transformers/model_doc/mega), [Megatron-BERT](https://huggingface.co/docs/transformers/model_doc/megatron-bert), [MobileBERT](https://huggingface.co/docs/transformers/model_doc/mobilebert),...
创建时间:
2025-03-17
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作