PII-NER
收藏魔搭社区2025-12-05 更新2025-09-06 收录
下载链接:
https://modelscope.cn/datasets/Josephgflowers/PII-NER
下载链接
链接失效反馈官方服务:
资源简介:
Dataset Card for NER PII Extraction Dataset
Dataset Summary
This dataset is designed for training and evaluating Named Entity Recognition (NER) models focused on extracting Personally Identifiable Information (PII) from text. It includes a variety of entities such as names, addresses, phone numbers, email addresses, and identification numbers. The dataset is suitable for tasks that involve PII detection, compliance checks, and data anonymization.
Supported Tasks and Leaderboards
Named Entity Recognition (NER): Identifying and classifying PII entities in text.
Data Anonymization: Masking or removing PII from textual data.
Compliance Monitoring: Ensuring that documents comply with privacy regulations by detecting PII.
Dataset Structure
Data Instances
Each data instance consists of three parts:
System Instruction: Guidelines provided to the assistant for extracting PII.
User Input: The textual content containing PII.
Assistant Response: The extracted PII entities in JSON format.
Example
{
"system": "Extract the following PII entities from the provided text and provide:\n- Extract each entity exactly as mentioned in the text.\n- Do not create new labels; return each entity under its corresponding label.\n- Provide a list of entities for each label. If no entities are found for a label, return an empty list.\n- Accuracy and relevance in your responses are crucial.\nLabels and their Descriptions:\n- AADHAR_ID: Extract Aadhar ID numbers.\n- NAME_STUDENT: Extract student names.\n- EMAIL: Extract email addresses.\n- USERNAME: Extract usernames.\n- PHONE_NUM: Extract phone numbers.\n- URL_PERSONAL: Extract personal URLs.\n- STREET_ADDRESS: Extract street addresses.\n- PAN_NUMBER: Extract PAN numbers.\n- DRIVING_LICENSE: Extract driving license numbers.\n- PASSPORT: Extract passport numbers.\nProvide your response in JSON format.",
"user": "As an aspiring writer, I, [Name], residing at [Address], am eager to express my thoughts and ideas through the written word. My digital footprint, accessible through my personal URL [Personal URL] and username [Username], provides a glimpse into my online presence. My personal identification details, including my PAN number [PAN Number], Driving License [Driving License Number], and Passport [Passport Number], attest to my commitment to responsibility. With a phone number of [Phone Number], I am easily accessible for communication.",
"assistant": {
"AADHAR_ID": [],
"NAME_STUDENT": ["[Name]"],
"EMAIL": [],
"USERNAME": ["[Username]"],
"PHONE_NUM": ["[Phone Number]"],
"URL_PERSONAL": ["[Personal URL]"],
"STREET_ADDRESS": ["[Address]"],
"PAN_NUMBER": ["[PAN Number]"],
"DRIVING_LICENSE": ["[Driving License Number]"],
"PASSPORT": ["[Passport Number]"]
}
}
Data Fields
system (str): The instructions provided to the assistant for extracting PII.
user (str): The input text containing PII.
assistant (dict): The assistant's response containing extracted PII entities in JSON format.
Note: The exact sizes of each split depend on the dataset version and should be specified accordingly.
Dataset Creation
Curation Rationale
The dataset was curated to assist in developing models that can accurately detect and extract various types of PII from textual data. This is essential for applications requiring data privacy compliance, data anonymization, and secure information handling.
# 命名实体识别 (Named Entity Recognition) 个人可识别信息 (Personally Identifiable Information) 提取数据集 数据集卡片
## 数据集概述
本数据集专为训练和评估聚焦于从文本中提取个人可识别信息 (Personally Identifiable Information) 的命名实体识别模型而设计。数据集涵盖姓名、地址、电话号码、电子邮箱地址以及身份证件号码等多类实体,适用于涉及个人可识别信息检测、合规性审查与数据匿名化的各类任务。
## 支持任务与排行榜
命名实体识别:在文本中识别并分类个人可识别信息实体。
数据匿名化:对文本数据中的个人可识别信息进行掩码或移除处理。
合规性监测:通过检测个人可识别信息,确保文档符合隐私法规要求。
## 数据集结构
### 数据实例
每个数据实例包含三部分:
- 系统指令:用于指导助手提取个人可识别信息的指南。
- 用户输入:包含个人可识别信息的文本内容。
- 助手响应:以JSON格式呈现的已提取的个人可识别信息实体。
### 示例
{
"system": "Extract the following PII entities from the provided text and provide:
- Extract each entity exactly as mentioned in the text.
- Do not create new labels; return each entity under its corresponding label.
- Provide a list of entities for each label. If no entities are found for a label, return an empty list.
- Accuracy and relevance in your responses are crucial.
Labels and their Descriptions:
- AADHAR_ID: Extract Aadhar ID numbers.
- NAME_STUDENT: Extract student names.
- EMAIL: Extract email addresses.
- USERNAME: Extract usernames.
- PHONE_NUM: Extract phone numbers.
- URL_PERSONAL: Extract personal URLs.
- STREET_ADDRESS: Extract street addresses.
- PAN_NUMBER: Extract PAN numbers.
- DRIVING_LICENSE: Extract driving license numbers.
- PASSPORT: Extract passport numbers.
Provide your response in JSON format.",
"user": "As an aspiring writer, I, [Name], residing at [Address], am eager to express my thoughts and ideas through the written word. My digital footprint, accessible through my personal URL [Personal URL] and username [Username], provides a glimpse into my online presence. My personal identification details, including my PAN number [PAN Number], Driving License [Driving License Number], and Passport [Passport Number], attest to my commitment to responsibility. With a phone number of [Phone Number], I am easily accessible for communication.",
"assistant": {
"AADHAR_ID": [],
"NAME_STUDENT": ["[Name]"],
"EMAIL": [],
"USERNAME": ["[Username]"],
"PHONE_NUM": ["[Phone Number]"],
"URL_PERSONAL": ["[Personal URL]"],
"STREET_ADDRESS": ["[Address]"],
"PAN_NUMBER": ["[PAN Number]"],
"DRIVING_LICENSE": ["[Driving License Number]"],
"PASSPORT": ["[Passport Number]"]
}
}
### 数据字段
- `system`(字符串类型):向助手提供的用于提取个人可识别信息的指令。
- `user`(字符串类型):包含个人可识别信息的输入文本。
- `assistant`(字典类型):助手的响应结果,以JSON格式存储已提取的个人可识别信息实体。
注:各数据集拆分的具体规模取决于数据集版本,需根据实际情况指定。
## 数据集构建
### 构建依据
本数据集的遴选构建旨在助力开发能够从文本数据中精准检测并提取各类个人可识别信息的模型。这对于需要满足数据隐私合规、数据匿名化以及安全信息处理的应用场景至关重要。
提供机构:
maas
创建时间:
2025-08-31
搜集汇总
数据集介绍

背景与挑战
背景概述
PII-NER数据集是一个专门用于训练和评估命名实体识别模型的数据集,专注于从文本中提取个人可识别信息(PII),如姓名、地址、电话号码和电子邮件等。它支持PII检测、合规检查和数据匿名化任务,数据实例以结构化格式(包括系统指令、用户输入和助手响应)提供,便于模型学习和评估。
以上内容由遇见数据集搜集并总结生成



