Nemotron-PII
收藏魔搭社区2026-01-07 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/nv-community/Nemotron-PII
下载链接
链接失效反馈官方服务:
资源简介:
# Nemotron-PII: Synthesized Data for Privacy-Preserving AI
## Dataset Description
Nemotron‑PII is a synthetic, persona‑grounded dataset for training and evaluating detection of Personally Identifiable Information (PII) and Protected Health Information (PHI) in text at production quality. It contains 100,000 English records across 50+ industries with span‑level annotations for 55+ PII/PHI categories, generated with NVIDIA NeMo Data Designer using synthetic personas grounded in U.S. Census data to ensure demographic realism and contextual consistency. This dataset includes both structured (e.g., forms, invoices) and unstructured (e.g., emails, free text) documents and explicitly marks locale conventions (U.S. or international) per record.
This dataset is ready for commercial use.
## Dataset Owner(s)
NVIDIA Corporation
## Dataset Creation Date
2025/10/28
## License/Terms of Use
This dataset is licensed under the Creative Commons Attribution 4.0 International License (CC BY 4.0) available at https://creativecommons.org/licenses/by/4.0/legalcode.
## Intended Usage
Train and evaluate Named Entity Recognition (NER) models for PII/PHI detection and redaction in healthcare, finance, legal, and enterprise compliance scenarios. Benchmark model robustness across prompt styles and formats using persona‑grounded texts spanning U.S. and international conventions. Support conversational AI safety and cross‑border privacy tooling with realistic yet synthetically-labeled data.
Although this dataset is fully synthetic and designed to avoid real personal data, practitioners should validate models in deployment to prevent missed detections or unintended leakage in downstream processes.
## Dataset Characterization
**Data Collection Method**<br>
[Synthetic] Generated via NVIDIA NeMo Data Designer; persona‑grounded for realistic, consistent entities within each document.
**Labeling Method**<br>
[Synthetic] Span‑level annotations for 55+ PII/PHI entities produced during generation.
## Dataset Format
Parquet format for efficient storage and processing; JSONL (UTF‑8) and CSV exports also available.
## Dataset Quantification
- Size: 100,000 records (50k train / 50k test)
- Domains: 50+ industries (e.g., healthcare, finance, cybersecurity)
- Entity Types: 55+ PII/PHI categories (e.g., names, SSNs, MRNs, addresses, phones, emails, account numbers)
- Locale Coverage: U.S. and international; international includes ~12% U.S.‑style overlap to reflect real‑world data diversity
- Content Types: Structured (forms, invoices) and unstructured (emails, notes, free text)
| Field | Format |
| :-- | :-- |
| uid | String/UUID |
| domain | String |
| document_type | String |
| document_description | String |
| document_format | String: structured \| unstructured |
| locale | String: us \| intl |
| text | UTF‑8 string |
| spans | List[{"start": int, "end": int, "label": str}] |
| text_tagged | String (inline tags) |
## Dataset Structure
- Splits:
- train: US + intl combined (50,000)
- test: US + intl combined (50,000)
- Columns:
- uid, domain, document_type, document_description, document_format, locale, text, spans, text_tagged
## References
- NVIDIA NeMo Data Designer (synthetic data generation): https://docs.nvidia.com/nemo/microservices/latest/generate-synthetic-data/index.html
- Generate Realistic Persons (personas): https://docs.nvidia.com/nemo/microservices/latest/generate-synthetic-data/generate-realistic-personal-details.html
- Nemotron‑Personas collection: https://huggingface.co/collections/nvidia/nemotron-personas
- Gretel PII Masking dataset (related work): https://huggingface.co/datasets/gretelai/gretel-pii-masking-en-v1
## Ethical Considerations
NVIDIA believes [Trustworthy AI](https://www.nvidia.com/en-us/ai-data-science/trustworthy-ai/) is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal teams to ensure this dataset meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).
## Citation
If you use this dataset in your research, please cite it as follows:
```bibtex
@dataset{nemotron-pii,
author = {Amy Steier and Andre Manoel and Alexa Haushalter and Maarten Van Segbroeck},
title = {Nemotron-PII: Synthesized Data for Privacy-Preserving AI},
year = {2025},
publisher = {NVIDIA},
url = {https://huggingface.co/datasets/nvidia/Nemotron-PII}
}
```
# Nemotron-PII:面向隐私保护AI的合成数据集
## 数据集描述
Nemotron-PII 是一款基于人设的合成数据集,用于以生产级质量训练和评估文本中的个人可识别信息(Personally Identifiable Information, PII)与受保护健康信息(Protected Health Information, PHI)检测任务。该数据集包含10万条英文记录,覆盖50余个行业,为55+类PII/PHI实体提供了跨度级标注,由NVIDIA NeMo Data Designer基于美国人口普查数据生成的合成人设构建,以确保人口统计学真实性与上下文一致性。本数据集涵盖结构化(如表单、发票)与非结构化(如邮件、自由文本)文档,并为每条记录明确标注了地域惯例(美国或国际)。
本数据集可商用。
## 数据集所有者
NVIDIA 公司(NVIDIA Corporation)
## 数据集创建日期
2025年10月28日
## 许可/使用条款
本数据集采用知识共享署名4.0国际许可协议(Creative Commons Attribution 4.0 International License, CC BY 4.0),许可详情可访问 https://creativecommons.org/licenses/by/4.0/legalcode。
## 预期用途
用于训练和评估面向PII/PHI检测与脱敏的命名实体识别(Named Entity Recognition, NER)模型,应用场景涵盖医疗、金融、法律与企业合规等领域。可基于覆盖美国与国际惯例的人设文本,评测模型在不同提示词风格与格式下的鲁棒性。可利用该真实感强且带有合成标注的数据集,支持对话式AI安全与跨境隐私工具开发。
尽管本数据集为完全合成且旨在避免包含真实个人数据,但开发者仍需在模型部署阶段进行验证,以防止下游流程中出现漏检或意外数据泄露。
## 数据集特征
**数据采集方式**<br>
【合成数据】通过NVIDIA NeMo Data Designer生成;基于人设构建,确保单份文档内的实体具备真实性与一致性。
**标注方式**<br>
【合成数据】在生成过程中完成55+类PII/PHI实体的跨度级标注。
## 数据集格式
数据集采用Parquet格式以实现高效存储与处理;同时支持JSONL(UTF-8)与CSV导出格式。
## 数据集量化统计
- 规模:10万条记录(训练集5万 / 测试集5万)
- 覆盖领域:50余个行业(如医疗、金融、网络安全)
- 实体类型:55+类PII/PHI类别(如姓名、社会安全号码、医疗记录编号、地址、电话、邮箱、账户号码)
- 地域覆盖:美国与国际;国际数据中包含约12%的美国风格重叠数据,以反映真实世界的数据多样性
- 内容类型:结构化数据(表单、发票)与非结构化数据(邮件、笔记、自由文本)
| 字段 | 格式 |
| :-- | :-- |
| uid | 字符串/通用唯一识别码(UUID) |
| domain | 字符串 |
| document_type | 字符串 |
| document_description | 字符串 |
| document_format | 字符串:结构化 | 非结构化 |
| locale | 字符串:us | intl |
| text | UTF-8字符串 |
| spans | 列表[{"start": 整数, "end": 整数, "label": 字符串}] |
| text_tagged | 字符串(内嵌标签) |
## 数据集结构
- 数据集划分:
- 训练集:美国与国际数据合并(50,000条)
- 测试集:美国与国际数据合并(50,000条)
- 字段列表:uid、domain、document_type、document_description、document_format、locale、text、spans、text_tagged
## 参考资料
- NVIDIA NeMo Data Designer(合成数据生成工具):https://docs.nvidia.com/nemo/microservices/latest/generate-synthetic-data/index.html
- 生成真实人设(个人信息):https://docs.nvidia.com/nemo/microservices/latest/generate-synthetic-data/generate-realistic-personal-details.html
- Nemotron-Personas 数据集集合:https://huggingface.co/collections/nvidia/nemotron-personas
- Gretel PII 掩码数据集(相关研究):https://huggingface.co/datasets/gretelai/gretel-pii-masking-en-v1
## 伦理考量
NVIDIA 认为,可信人工智能(Trustworthy AI)是一项共同责任,我们已建立相关政策与实践规范,以支持各类AI应用的开发。开发者在按照服务条款下载或使用本数据集时,应与内部团队协作,确保本数据集符合相关行业与应用场景的要求,并应对潜在的产品误用问题。
请通过以下链接报告安全漏洞或NVIDIA人工智能相关问题:https://www.nvidia.com/en-us/support/submit-security-vulnerability/
## 引用规范
若您在研究中使用本数据集,请按以下方式引用:
bibtex
@dataset{nemotron-pii,
author = {Amy Steier and Andre Manoel and Alexa Haushalter and Maarten Van Segbroeck},
title = {Nemotron-PII:面向隐私保护AI的合成数据集},
year = {2025},
publisher = {NVIDIA},
url = {https://huggingface.co/datasets/nvidia/Nemotron-PII}
}
提供机构:
maas
创建时间:
2025-10-29



