Ganasekhar/pii-masking-400k

Name: Ganasekhar/pii-masking-400k
Creator: Ganasekhar
Published: 2026-03-10 11:03:11
License: 暂无描述

Hugging Face2026-03-10 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/Ganasekhar/pii-masking-400k

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: other license_name: license.md language: - en - fr - de - it - es - nl task_categories: - text-classification - token-classification - table-question-answering - question-answering - zero-shot-classification - summarization - feature-extraction - text-generation - text2text-generation - translation - fill-mask - tabular-classification - tabular-to-text - table-to-text - text-retrieval - other multilinguality: - multilingual tags: - legal - business - psychology - privacy - gdpr - euaiact - aiact - pii - sensitive size_categories: - 100K<n<1M pretty_name: Ai4Privacy PII 300k Dataset source_datasets: - original configs: - config_name: default data_files: - split: train path: "data/train/*.jsonl" - split: validation path: "data/validation/*.jsonl" --- # Purpose and Features 🌍 World's largest open dataset for privacy masking 🌎 The dataset is useful to train and evaluate models to remove personally identifiable and sensitive information from text, especially in the context of AI assistants and LLMs. # AI4Privacy Dataset Analytics 📊 ## Dataset Overview - **Total entries:** 406,896 - **Total tokens:** 20,564,179 - **Total PII tokens:** 2,357,029 - **Number of PII classes in public dataset:** 17 - **Number of PII classes in extended dataset:** 63 ## Language Distribution 🌍 - English (en): 85,321 - Italian (it): 81,007 - French (fr): 80,249 - German (de): 79,880 - Dutch (nl): 38,916 - Spanish (es): 41,523 ## Locale Distribution 🌎 - United Kingdom (GB) 🇬🇧: 41,853 - United States (US) 🇺🇸: 43,468 - Italy (IT) 🇮🇹: 40,629 - France (FR) 🇫🇷: 40,026 - Switzerland (CH) 🇨🇭: 119,440 - Netherlands (NL) 🇳🇱: 38,916 - Germany (DE) 🇩🇪: 41,041 - Spain (ES) 🇪🇸: 41,523 ## Dataset Split - Train: 325,517 (80.00%) - Validation: 81,379 (20.00%) ## Key Facts 🔑 - This is synthetic data! Generated using proprietary algorithms - no privacy violations! 🛡️ - 6 languages in total with strong localisation in 8 jurisdictions. - The extended dataset includes a total of 63 PII classes, providing even more comprehensive coverage of sensitive information. - Synthetic data generated using proprietary algorithms - No privacy violations! For more information about the extended dataset or to discuss partnership opportunities, please contact us at partnerships@ai4privacy.com 📧 # Getting started Option 1: Python ```terminal pip install datasets ``` ```python from datasets import load_dataset dataset = load_dataset("ai4privacy/pii-masking-400k") ``` # Text entry lengths and PII distributions This is the 4th iteration of the pii-masking series datasets and we have further improved it by improving the average text entry length. The current distribution of sensitive data and PII tokens: ![PII Type Distribution](pii_type_distribution_pii_400k.png) # Compatible Machine Learning Tasks: - Tokenclassification. Check out a HuggingFace's [guide on token classification](https://huggingface.co/docs/transformers/tasks/token_classification). - [ALBERT](https://huggingface.co/docs/transformers/model_doc/albert), [BERT](https://huggingface.co/docs/transformers/model_doc/bert), [BigBird](https://huggingface.co/docs/transformers/model_doc/big_bird), [BioGpt](https://huggingface.co/docs/transformers/model_doc/biogpt), [BLOOM](https://huggingface.co/docs/transformers/model_doc/bloom), [BROS](https://huggingface.co/docs/transformers/model_doc/bros), [CamemBERT](https://huggingface.co/docs/transformers/model_doc/camembert), [CANINE](https://huggingface.co/docs/transformers/model_doc/canine), [ConvBERT](https://huggingface.co/docs/transformers/model_doc/convbert), [Data2VecText](https://huggingface.co/docs/transformers/model_doc/data2vec-text), [DeBERTa](https://huggingface.co/docs/transformers/model_doc/deberta), [DeBERTa-v2](https://huggingface.co/docs/transformers/model_doc/deberta-v2), [DistilBERT](https://huggingface.co/docs/transformers/model_doc/distilbert), [ELECTRA](https://huggingface.co/docs/transformers/model_doc/electra), [ERNIE](https://huggingface.co/docs/transformers/model_doc/ernie), [ErnieM](https://huggingface.co/docs/transformers/model_doc/ernie_m), [ESM](https://huggingface.co/docs/transformers/model_doc/esm), [Falcon](https://huggingface.co/docs/transformers/model_doc/falcon), [FlauBERT](https://huggingface.co/docs/transformers/model_doc/flaubert), [FNet](https://huggingface.co/docs/transformers/model_doc/fnet), [Funnel Transformer](https://huggingface.co/docs/transformers/model_doc/funnel), [GPT-Sw3](https://huggingface.co/docs/transformers/model_doc/gpt-sw3), [OpenAI GPT-2](https://huggingface.co/docs/transformers/model_doc/gpt2), [GPTBigCode](https://huggingface.co/docs/transformers/model_doc/gpt_bigcode), [GPT Neo](https://huggingface.co/docs/transformers/model_doc/gpt_neo), [GPT NeoX](https://huggingface.co/docs/transformers/model_doc/gpt_neox), [I-BERT](https://huggingface.co/docs/transformers/model_doc/ibert), [LayoutLM](https://huggingface.co/docs/transformers/model_doc/layoutlm), [LayoutLMv2](https://huggingface.co/docs/transformers/model_doc/layoutlmv2), [LayoutLMv3](https://huggingface.co/docs/transformers/model_doc/layoutlmv3), [LiLT](https://huggingface.co/docs/transformers/model_doc/lilt), [Longformer](https://huggingface.co/docs/transformers/model_doc/longformer), [LUKE](https://huggingface.co/docs/transformers/model_doc/luke), [MarkupLM](https://huggingface.co/docs/transformers/model_doc/markuplm), [MEGA](https://huggingface.co/docs/transformers/model_doc/mega), [Megatron-BERT](https://huggingface.co/docs/transformers/model_doc/megatron-bert), [MobileBERT](https://huggingface.co/docs/transformers/model_doc/mobilebert), [MPNet](https://huggingface.co/docs/transformers/model_doc/mpnet), [MPT](https://huggingface.co/docs/transformers/model_doc/mpt), [MRA](https://huggingface.co/docs/transformers/model_doc/mra), [Nezha](https://huggingface.co/docs/transformers/model_doc/nezha), [Nyströmformer](https://huggingface.co/docs/transformers/model_doc/nystromformer), [QDQBert](https://huggingface.co/docs/transformers/model_doc/qdqbert), [RemBERT](https://huggingface.co/docs/transformers/model_doc/rembert), [RoBERTa](https://huggingface.co/docs/transformers/model_doc/roberta), [RoBERTa-PreLayerNorm](https://huggingface.co/docs/transformers/model_doc/roberta-prelayernorm), [RoCBert](https://huggingface.co/docs/transformers/model_doc/roc_bert), [RoFormer](https://huggingface.co/docs/transformers/model_doc/roformer), [SqueezeBERT](https://huggingface.co/docs/transformers/model_doc/squeezebert), [XLM](https://huggingface.co/docs/transformers/model_doc/xlm), [XLM-RoBERTa](https://huggingface.co/docs/transformers/model_doc/xlm-roberta), [XLM-RoBERTa-XL](https://huggingface.co/docs/transformers/model_doc/xlm-roberta-xl), [XLNet](https://huggingface.co/docs/transformers/model_doc/xlnet), [X-MOD](https://huggingface.co/docs/transformers/model_doc/xmod), [YOSO](https://huggingface.co/docs/transformers/model_doc/yoso) - Text Generation: Mapping the unmasked_text to to the masked_text or privacy_mask attributes. Check out HuggingFace's [guide to fine-tunning](https://huggingface.co/docs/transformers/v4.15.0/training) - [T5 Family](https://huggingface.co/docs/transformers/model_doc/t5), [Llama2](https://huggingface.co/docs/transformers/main/model_doc/llama2) # Information regarding the rows: - Each row represents a json object with a natural language text that includes placeholders for PII. - Sample row: - "source_text" shows a natural text containing PII - "Subject: Group Messaging for Admissions Process\n\nGood morning, everyone,\n\nI hope this message finds you well. As we continue our admissions processes, I would like to update you on the latest developments and key information. Please find below the timeline for our upcoming meetings:\n\n- wynqvrh053 - Meeting at 10:20am\n- luka.burg - Meeting at 21\n- qahil.wittauer - Meeting at quarter past 13\n- gholamhossein.ruschke - Meeting at 9:47 PM\n- pdmjrsyoz1460 " - "target_text" contains a masked version of the source text - "Subject: Group Messaging for Admissions Process\n\nGood morning, everyone,\n\nI hope this message finds you well. As we continue our admissions processes, I would like to update you on the latest developments and key information. Please find below the timeline for our upcoming meetings:\n\n- [USERNAME] - Meeting at [TIME]\n- [USERNAME] - Meeting at [TIME]\n- [USERNAME] - Meeting at [TIME]\n- [USERNAME] - Meeting at [TIME]\n- [USERNAME] " - "privacy_mask" contains the information explicit format for privacy mask labels - [{"value": "wynqvrh053", "start": 287, "end": 297, "label": "USERNAME"}, {"value": "10:20am", "start": 311, "end": 318, "label": "TIME"}, {"value": "luka.burg", "start": 321, "end": 330, "label": "USERNAME"}, {"value": "21", "start": 344, "end": 346, "label": "TIME"}, {"value": "qahil.wittauer", "start": 349, "end": 363, "label": "USERNAME"}, {"value": "quarter past 13", "start": 377, "end": 392, "label": "TIME"}, {"value": "gholamhossein.ruschke", "start": 395, "end": 416, "label": "USERNAME"}, {"value": "9:47 PM", "start": 430, "end": 437, "label": "TIME"}, {"value": "pdmjrsyoz1460", "start": 440, "end": 453, "label": "USERNAME"}], - "span_labels" displays the exact mapping spans of the private information within the text - [[440, 453, "USERNAME"], [430, 437, "TIME"], [395, 416, "USERNAME"], [377, 392, "TIME"], [349, 363, "USERNAME"], [344, 346, "TIME"], [321, 330, "USERNAME"], [311, 318, "TIME"], [287, 297, "USERNAME"]], - "mberttokens" indicates the breakdown of the text into tokens associated with multi-lingual bert - ["Sub", "##ject", ":", "Group", "Mess", "##aging", "for", "Ad", "##mission", "##s", "Process", "Good", "morning", ",", "everyone", ",", "I", "hope", "this", "message", "finds", "you", "well", ".", "As", "we", "continue", "our", "admission", "##s", "processes", ",", "I", "would", "like", "to", "update", "you", "on", "the", "latest", "developments", "and", "key", "information", ".", "Please", "find", "below", "the", "time", "##line", "for", "our", "upcoming", "meetings", ":", "-", "w", "##yn", "##q", "##vr", "##h", "##0", "##53", "-", "Meeting", "at", "10", ":", "20", "##am", "-", "luka", ".", "bu", "##rg", "-", "Meeting", "at", "21", "-", "q", "##ahi", "##l", ".", "wit", "##tau", "##er", "-", "Meeting", "at", "quarter", "past", "13", "-", "gh", "##ola", "##mh", "##osse", "##in", ".", "rus", "##ch", "##ke", "-", "Meeting", "at", "9", ":", "47", "PM", "-", "p", "##d", "##m", "##jr", "##sy", "##oz", "##14", "##60"] - mbert_bio_labels demonstrates the labels associated with the BIO labelling task in Machine Learning using the mbert tokens. - ["O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "B-USERNAME", "I-USERNAME", "I-USERNAME", "O", "O", "O", "O", "O", "O", "O", "B-TIME", "I-TIME", "I-TIME", "O", "B-USERNAME", "I-USERNAME", "O", "O", "O", "B-TIME", "I-TIME", "I-USERNAME", "I-USERNAME", "I-USERNAME", "I-USERNAME", "I-USERNAME", "I-USERNAME", "I-USERNAME", "O", "O", "O", "O", "B-TIME", "I-TIME", "I-TIME", "I-TIME", "I-TIME", "I-TIME", "I-TIME", "I-TIME", "I-TIME", "I-TIME", "O", "B-USERNAME", "I-USERNAME"]," - "id": indicates the ID of the entry for future reference and feedback - "40767A" - "language": content of the language - "en" - "locale": content of the locale associated with the data - "split": type of the machine learning set - "train" or "validation" *note for the nested objects, we store them as string to maximise compability between various software. # About Us: At Ai4Privacy, we are commited to building the global seatbelt of the 21st century for Artificial Intelligence to help fight against potential risks of personal information being integrated into data pipelines. Newsletter & updates: [www.Ai4Privacy.com](www.Ai4Privacy.com) - Looking for ML engineers, developers, beta-testers, human in the loop validators (all languages) - Integrations with already existing open solutions - Ask us a question on discord: [https://discord.gg/kxSbJrUQZF](https://discord.gg/kxSbJrUQZF) # Roadmap and Future Development - Carbon neutral - Additional benchmarking methods for NER - Better multilingual and especially localisation - Continuously increase the training and testing sets # Known Issues - Weird usage of the PII which will be resolved with the next release # Use Cases and Applications **Chatbots**: Incorporating a PII masking model into chatbot systems can ensure the privacy and security of user conversations by automatically redacting sensitive information such as names, addresses, phone numbers, and email addresses. **Customer Support Systems**: When interacting with customers through support tickets or live chats, masking PII can help protect sensitive customer data, enabling support agents to handle inquiries without the risk of exposing personal information. **Email Filtering**: Email providers can utilize a PII masking model to automatically detect and redact PII from incoming and outgoing emails, reducing the chances of accidental disclosure of sensitive information. **Data Anonymization**: Organizations dealing with large datasets containing PII, such as medical or financial records, can leverage a PII masking model to anonymize the data before sharing it for research, analysis, or collaboration purposes. **Social Media Platforms**: Integrating PII masking capabilities into social media platforms can help users protect their personal information from unauthorized access, ensuring a safer online environment. **Content Moderation**: PII masking can assist content moderation systems in automatically detecting and blurring or redacting sensitive information in user-generated content, preventing the accidental sharing of personal details. **Online Forms**: Web applications that collect user data through online forms, such as registration forms or surveys, can employ a PII masking model to anonymize or mask the collected information in real-time, enhancing privacy and data protection. **Collaborative Document Editing**: Collaboration platforms and document editing tools can use a PII masking model to automatically mask or redact sensitive information when multiple users are working on shared documents. **Research and Data Sharing**: Researchers and institutions can leverage a PII masking model to ensure privacy and confidentiality when sharing datasets for collaboration, analysis, or publication purposes, reducing the risk of data breaches or identity theft. **Content Generation**: Content generation systems, such as article generators or language models, can benefit from PII masking to automatically mask or generate fictional PII when creating sample texts or examples, safeguarding the privacy of individuals. (...and whatever else your creative mind can think of) # Licensing Academic use is encouraged with proper citation provided it follows similar license terms*. Commercial entities should contact us at licensing@ai4privacy.com for licensing inquiries and additional data access.* * Terms apply. See [LICENSE.md](LICENSE.md) for full details. # Support and Maintenance AI4Privacy is a project affiliated with [Ai Suisse SA](https://www.aisuisse.com/).

### 数据集元数据 - 许可证：其他 - 许可证名称：license.md - 支持语言：英语（en）、法语（fr）、德语（de）、意大利语（it）、西班牙语（es）、荷兰语（nl） - 任务类别：文本分类、Token分类、表格问答、问答、零样本分类、摘要、特征提取、文本生成、文本到文本生成、机器翻译、掩码填充、表格分类、表格到文本、文本检索、其他 - 多语言属性：多语言 - 标签：法律、商业、心理学、隐私、通用数据保护条例（General Data Protection Regulation，GDPR）、欧盟人工智能法案（European Union AI Act，EUAIAct）、人工智能法案（AI Act，AIAct）、个人可识别信息（Personally Identifiable Information，PII）、敏感信息 - 规模类别：100,000 < 样本量 < 1,000,000 - 美观名称：Ai4Privacy PII 30万数据集 - 源数据集：原始数据集 - 配置项： - 配置名称：默认数据文件： - 训练集：data/train/*.jsonl - 验证集：data/validation/*.jsonl ## 数据集用途与特性 🌍 全球最大的开源隐私掩码数据集 🌍 本数据集可用于训练和评估模型，以从文本中移除个人可识别信息（PII）与敏感信息，尤其适用于AI智能体（AI Agent）和大语言模型（Large Language Model，LLM）场景。 ## AI4Privacy 数据集分析 📊 ### 数据集概览 - **总条目数**：406,896 - **总Token数**：20,564,179 - **PII总Token数**：2,357,029 - **公开数据集PII类别数**：17 - **扩展数据集PII类别数**：63 ### 语言分布 🌍 - 英语（en）：85,321 - 意大利语（it）：81,007 - 法语（fr）：80,249 - 德语（de）：79,880 - 荷兰语（nl）：38,916 - 西班牙语（es）：41,523 ### 地区分布 🌎 - 英国（GB）🇬🇧：41,853 - 美国（US）🇺🇸：43,468 - 意大利（IT）🇮🇹：40,629 - 法国（FR）🇫🇷：40,026 - 瑞士（CH）🇨🇭：119,440 - 荷兰（NL）🇳🇱：38,916 - 德国（DE）🇩🇪：41,041 - 西班牙（ES）🇪🇸：41,523 ### 数据集拆分 - 训练集：325,517（占比80.00%） - 验证集：81,379（占比20.00%） ### 核心要点 🔑 - 本数据集为合成数据！采用专有算法生成，无隐私违规风险 🛡️ - 涵盖6种语言，在8个司法辖区实现了良好的本地化适配。 - 扩展数据集包含总计63个PII类别，可更全面地覆盖敏感信息范畴。 - 本合成数据采用专有算法生成，无隐私违规风险！如需了解扩展数据集详情或洽谈合作机会，请发送邮件至 partnerships@ai4privacy.com 📧 ## 快速上手 ### 选项1：Python terminal pip install datasets python from datasets import load_dataset dataset = load_dataset("ai4privacy/pii-masking-400k") ## 文本条目长度与PII分布本数据集为PII掩码系列数据集的第4版，我们通过优化平均文本条目长度进一步提升了数据集质量。当前敏感数据与PII Token的分布如下： ![PII Type Distribution](pii_type_distribution_pii_400k.png) ## 兼容的机器学习任务 - **Token分类**：可参考HuggingFace的[Token分类指南](https://huggingface.co/docs/transformers/tasks/token_classification)。支持的模型包括：ALBERT、BERT、BigBird、BioGpt、BLOOM、BROS、CamemBERT、CANINE、ConvBERT、Data2VecText、DeBERTa、DeBERTa-v2、DistilBERT、ELECTRA、ERNIE、ErnieM、ESM、Falcon、FlauBERT、FNet、Funnel Transformer、GPT-Sw3、OpenAI GPT-2、GPTBigCode、GPT Neo、GPT NeoX、I-BERT、LayoutLM、LayoutLMv2、LayoutLMv3、LiLT、Longformer、LUKE、MarkupLM、MEGA、Megatron-BERT、MobileBERT、MPNet、MPT、MRA、Nezha、Nyströmformer、QDQBert、RemBERT、RoBERTa、RoBERTa-PreLayerNorm、RoCBert、RoFormer、SqueezeBERT、XLM、XLM-RoBERTa、XLM-RoBERTa-XL、XLNet、X-MOD、YOSO - **文本生成**：实现从原始文本到掩码文本或隐私掩码属性的映射。可参考HuggingFace的[微调指南](https://huggingface.co/docs/transformers/v4.15.0/training) 支持的模型包括：T5系列、Llama2 ## 数据条目说明 - 每条数据为一个JSON对象，包含一段自然语言文本，其中带有PII占位符。 - 示例条目： - `source_text`：包含PII的自然文本示例 > "Subject: Group Messaging for Admissions Process Good morning, everyone, I hope this message finds you well. As we continue our admissions processes, I would like to update you on the latest developments and key information. Please find below the timeline for our upcoming meetings: - wynqvrh053 - Meeting at 10:20am - luka.burg - Meeting at 21 - qahil.wittauer - Meeting at quarter past 13 - gholamhossein.ruschke - Meeting at 9:47 PM - pdmjrsyoz1460 " - `target_text`：该文本的掩码版本 > "Subject: Group Messaging for Admissions Process Good morning, everyone, I hope this message finds you well. As we continue our admissions processes, I would like to update you on the latest developments and key information. Please find below the timeline for our upcoming meetings: - [USERNAME] - Meeting at [TIME] - [USERNAME] - Meeting at [TIME] - [USERNAME] - Meeting at [TIME] - [USERNAME] - Meeting at [TIME] - [USERNAME] " - `privacy_mask`：隐私掩码标签的显式格式信息 > [{"value": "wynqvrh053", "start": 287, "end": 297, "label": "USERNAME"}, {"value": "10:20am", "start": 311, "end": 318, "label": "TIME"}, {"value": "luka.burg", "start": 321, "end": 330, "label": "USERNAME"}, {"value": "21", "start": 344, "end": 346, "label": "TIME"}, {"value": "qahil.wittauer", "start": 349, "end": 363, "label": "USERNAME"}, {"value": "quarter past 13", "start": 377, "end": 392, "label": "TIME"}, {"value": "gholamhossein.ruschke", "start": 395, "end": 416, "label": "USERNAME"}, {"value": "9:47 PM", "start": 430, "end": 437, "label": "TIME"}, {"value": "pdmjrsyoz1460", "start": 440, "end": 453, "label": "USERNAME"}], - `span_labels`：文本中私有信息的精确映射区间 > [[440, 453, "USERNAME"], [430, 437, "TIME"], [395, 416, "USERNAME"], [377, 392, "TIME"], [349, 363, "USERNAME"], [344, 346, "TIME"], [321, 330, "USERNAME"], [311, 318, "TIME"], [287, 297, "USERNAME"]], - `mberttokens`：文本按照多语言BERT（mBERT）的Token拆分结果 > ["Sub", "##ject", ":", "Group", "Mess", "##aging", "for", "Ad", "##mission", "##s", "Process", "Good", "morning", ",", "everyone", ",", "I", "hope", "this", "message", "finds", "you", "well", ".", "As", "we", "continue", "our", "admission", "##s", "processes", ",", "I", "would", "like", "to", "update", "you", "on", "the", "latest", "developments", "and", "key", "information", ".", "Please", "find", "below", "the", "time", "##line", "for", "our", "upcoming", "meetings", ":", "-", "w", "##yn", "##q", "##vr", "##h", "##0", "##53", "-", "Meeting", "at", "10", ":", "20", "##am", "-", "luka", ".", "bu", "##rg", "-", "Meeting", "at", "21", "-", "q", "##ahi", "##l", ".", "wit", "##tau", "##er", "-", "Meeting", "at", "quarter", "past", "13", "-", "gh", "##ola", "##mh", "##osse", "##in", ".", "rus", "##ch", "##ke", "-", "Meeting", "at", "9", ":", "47", "PM", "-", "p", "##d", "##m", "##jr", "##sy", "##oz", "##14", "##60"] - `mbertt_bio_labels`：基于mBERT Token的BIO标注任务对应的标签 > ["O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "B-USERNAME", "I-USERNAME", "I-USERNAME", "O", "O", "O", "O", "O", "O", "O", "B-TIME", "I-TIME", "I-TIME", "O", "B-USERNAME", "I-USERNAME", "O", "O", "O", "B-TIME", "I-TIME", "I-USERNAME", "I-USERNAME", "I-USERNAME", "I-USERNAME", "I-USERNAME", "I-USERNAME", "I-USERNAME", "O", "O", "O", "O", "B-TIME", "I-TIME", "I-TIME", "I-TIME", "I-TIME", "I-TIME", "I-TIME", "I-TIME", "I-TIME", "I-TIME", "O", "B-USERNAME", "I-USERNAME"], - `id`：数据条目的唯一标识，用于后续参考与反馈 > "40767A" - `language`：文本所属语言 > "en" - `locale`：数据关联的地区标识 - `split`：机器学习数据集拆分类型 > "train"（训练集）或 "validation"（验证集） *注：对于嵌套对象，我们将其存储为字符串以最大化与各类软件的兼容性。 ## 关于我们我们Ai4Privacy团队致力于构建人工智能领域的21世纪全球安全防护网，助力抵御个人信息被纳入数据管道所带来的潜在风险。订阅通讯与获取更新：[www.Ai4Privacy.com](www.Ai4Privacy.com) - 招聘机器学习工程师、开发人员、Beta测试员、人工循环验证员（支持所有语言） - 提供与现有开源解决方案的集成服务 - 欢迎在Discord提问：[https://discord.gg/kxSbJrUQZF](https://discord.gg/kxSbJrUQZF) ## 路线图与未来规划 - 实现碳中和 - 新增命名实体识别（NER）基准测试方法 - 优化多语言支持，尤其是本地化适配 - 持续扩大训练与测试集规模 ## 已知问题 - PII使用存在异常情况，将在下一版本中修复 ## 应用场景 **聊天机器人**：在聊天机器人系统中集成PII掩码模型，可自动脱敏姓名、地址、电话号码、电子邮箱等敏感信息，保障用户对话的隐私与安全。 **客户支持系统**：在通过工单或在线聊天与客户交互时，PII掩码可帮助保护敏感客户数据，使支持人员能够安全处理咨询，避免泄露个人信息。 **电子邮件过滤**：电子邮件服务商可利用PII掩码模型自动检测并脱敏进出邮件中的PII，降低敏感信息意外泄露的风险。 **数据匿名化**：处理包含PII的大规模数据集（如医疗或金融记录）的组织，可借助PII掩码模型在共享数据用于研究、分析或协作前完成匿名化处理。 **社交媒体平台**：在社交媒体平台中集成PII掩码功能，可帮助用户保护个人信息免受未授权访问，打造更安全的线上环境。 **内容审核**：PII掩码可协助内容审核系统自动检测并模糊或脱敏用户生成内容中的敏感信息，防止个人细节被意外分享。 **在线表单**：通过在线表单（如注册表单或调查问卷）收集用户数据的Web应用，可采用PII掩码模型实时匿名化或脱敏收集的信息，强化隐私与数据保护。 **协作文档编辑**：协作平台与文档编辑工具可在多用户编辑共享文档时，利用PII掩码模型自动脱敏敏感信息。 **研究与数据共享**：研究人员与机构在共享数据集用于协作、分析或发表时，可借助PII掩码模型保障隐私与机密性，降低数据泄露或身份盗用的风险。 **内容生成**：内容生成系统（如文章生成器或语言模型）可利用PII掩码自动脱敏或生成虚构PII，用于创建示例文本或范例，保护个人隐私。（以及其他你能想到的创意应用场景） ## 许可证说明鼓励学术使用，但需正确引用并遵守类似的许可证条款*。商业实体请发送邮件至 licensing@ai4privacy.com 咨询许可证事宜并获取额外数据访问权限*。 *适用相关条款。完整详情请参阅 [LICENSE.md](LICENSE.md)。 ## 支持与维护 AI4Privacy是隶属于[Ai Suisse SA](https://www.aisuisse.com/)的项目。

提供机构：

Ganasekhar

5,000+

优质数据集

54 个

任务类型

进入经典数据集