ai4privacy/pii-masking-400k

Name: ai4privacy/pii-masking-400k
Creator: ai4privacy
Published: 2025-02-23 19:00:11
License: 暂无描述

Hugging Face2025-02-23 更新2025-04-12 收录

下载链接：

https://hf-mirror.com/datasets/ai4privacy/pii-masking-400k

下载链接

链接失效反馈

官方服务：

资源简介：

Ai4Privacy PII 300k 数据集是一个合成数据集，用于训练和评估用于隐私遮蔽的模型，专注于从文本中移除个人身份识别信息和敏感信息。该数据集支持多种语言和司法辖区，并设计用于各种自然语言处理任务，如文本分类、标记分类和文本生成等。它也适用于聊天机器人、客户支持系统、电子邮件过滤和内容审查等应用。数据集分为训练集和验证集，提供多种语言版本，包括英语、意大利语、法语、德语、荷兰语、西班牙语等。数据集是合成的，使用专有算法生成，以避免隐私违规。在扩展数据集中包含总共63个PII类别，以全面覆盖敏感信息。该数据集与多种机器学习模型兼容，并可以使用HuggingFace数据集库轻松加载。

The Ai4Privacy PII 300k Dataset is a synthetic dataset designed for training and evaluating models for privacy masking, focusing on the removal of personally identifiable and sensitive information from text. Supporting multiple languages and jurisdictions, this dataset is tailored for various NLP tasks such as text classification, token classification, and text generation, among others. It is also suitable for applications like chatbots, customer support systems, email filtering, and content moderation. Available in languages including English, Italian, French, German, Dutch, Spanish, and more, the dataset is split into training and validation sets. The synthetic nature of the dataset, generated using proprietary algorithms, ensures no privacy violations. It offers a total of 63 PII classes in the extended dataset for comprehensive coverage of sensitive information. Compatible with multiple machine learning models, the dataset can be easily loaded using the HuggingFace datasets library.

提供机构：

ai4privacy

5,000+

优质数据集

54 个

任务类型

进入经典数据集