five

Ari-S-123/better-english-pii-anonymizer

收藏
Hugging Face2025-12-05 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/Ari-S-123/better-english-pii-anonymizer
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit task_categories: - token-classification language: - en tags: - pii - ner - privacy - synthetic-data size_categories: - 100K<n<1M configs: - config_name: default data_files: - split: train path: "data/train-00000-of-00001.parquet" - split: test path: "data/test-00000-of-00001.parquet" --- # PII Detection Combined Dataset Combined dataset for PII (Personally Identifiable Information) detection, merging the ai4privacy English-only subset with synthetically generated challenging examples targeting NER failure modes. ## Dataset Description This dataset combines two sources: 1. **[ai4privacy/open-pii-masking-500k](https://huggingface.co/datasets/ai4privacy/open-pii-masking-500k-ai4privacy)** (English subset): 120,533 train / 30,160 test examples 2. **Synthetic data** (Grok-4.1-Non-reasoning generated/GPT-5.1 validated): 4,801 train / 1,201 test examples **Total**: 125,334 train / 31,361 test examples ### Synthetic Data Feature Dimensions The synthetic data specifically targets six NER failure mode dimensions from [Singh & Narayanan (2025) "Unmasking the Reality of PII Masking Models: Performance Gaps and the Call for Accountability"](https://arxiv.org/abs/2504.12308): | Dimension | Train Count | Description | |-----------|-------------|-------------| | adversarial | 464 | Intentionally deceptive patterns | | basic | 989 | Standard, well-formatted entities | | contextual | 801 | Ambiguous entities requiring context | | evolving | 754 | Modern/emerging PII formats | | multilingual | 917 | International formats in English | | noisy | 876 | Real-world text imperfections | ## Dataset Schema | Field | Type | Description | |-------|------|-------------| | `source_text` | string | Original text containing PII entities | | `privacy_mask` | list | Entity annotations with label, start, end, value | | `data_source` | string | Either "ai4privacy" or "synthetic" | | `feature_dimension` | string | NER challenge dimension (synthetic only) | | `language` | string | Language code (always "en") | ## Usage ```python from datasets import load_dataset # Load from HuggingFace Hub dataset = load_dataset("Ari-S-123/better-english-pii-anonymizer") # Or load from local Parquet files dataset = load_dataset("parquet", data_files={ "train": "train.parquet", "test": "test.parquet" }) # Access examples print(dataset["train"][0]) ``` ## Citation If you use this dataset, please cite: ```bibtex @misc{pii_combined_dataset_2025, title={PII Detection Combined Dataset}, year={2025}, publisher={Hugging Face}, note={Combines ai4privacy English subset with synthetic challenging examples} } ``` ## License MIT License ## Dataset Creation - **Created**: 2025-12-05 - **ai4privacy source**: [ai4privacy/open-pii-masking-500k-ai4privacy](https://huggingface.co/datasets/ai4privacy/open-pii-masking-500k-ai4privacy) - **Synthetic generation**: xAI Grok-4.1-Non-reasoning - **Synthetic validation**: OpenAI GPT-5.1 - **Split strategy**: 80/20 stratified split on synthetic data

--- 许可证:MIT 任务类别: - 令牌分类(Token Classification) 语言: - 英文(en) 标签: - 个人可识别信息(Personally Identifiable Information,PII) - 命名实体识别(Named Entity Recognition,NER) - 隐私 - 合成数据 样本规模: - 10万<样本数<100万 配置: - 配置名称:默认(default) 数据文件: - 划分集:训练集(train) 路径:"data/train-00000-of-00001.parquet" - 划分集:测试集(test) 路径:"data/test-00000-of-00001.parquet" --- # 个人可识别信息(PII)检测联合数据集 本数据集为PII检测专用联合数据集,整合了ai4privacy的纯英文子集与针对NER失效模式生成的高难度合成样本。 ## 数据集说明 本数据集包含两个来源: 1. **ai4privacy/open-pii-masking-500k**(英文子集):包含120,533条训练样本与30,160条测试样本 2. **合成数据**(由Grok-4.1-Non-reasoning生成、GPT-5.1验证):包含4,801条训练样本与1,201条测试样本 **总计**:训练样本共125,334条,测试样本共31,361条 ### 合成数据特征维度 本合成数据专门针对[Singh与Narayanan(2025)《揭露PII掩码模型的真实面貌:性能差距与问责呼吁》](https://arxiv.org/abs/2504.12308)中提出的6类NER失效模式维度: | 维度 | 训练样本数 | 描述 | |------|------------|------| | 对抗性 | 464 | 故意设计的欺骗性模式 | | 基础 | 989 | 格式规范的标准实体 | | 上下文依赖 | 801 | 需要结合上下文才能识别的模糊实体 | | 演进型 | 754 | 现代/新兴PII格式 | | 多语言适配 | 917 | 英文语境下的国际格式实体 | | 噪声型 | 876 | 包含真实文本瑕疵的样本 | ## 数据集架构 | 字段 | 数据类型 | 描述 | |------|----------|------| | `source_text` | 字符串 | 包含PII实体的原始文本 | | `privacy_mask` | 列表 | 包含标签、起始位置、结束位置与实体值的实体标注集合 | | `data_source` | 字符串 | 数据源,取值为"ai4privacy"或"synthetic" | | `feature_dimension` | 字符串 | NER挑战维度(仅合成数据包含该字段) | | `language` | 字符串 | 语言代码(固定为"en") | ## 使用方法 python from datasets import load_dataset # 从Hugging Face Hub加载数据集 dataset = load_dataset("Ari-S-123/better-english-pii-anonymizer") # 或从本地Parquet文件加载 dataset = load_dataset("parquet", data_files={ "train": "train.parquet", "test": "test.parquet" }) # 访问样本示例 print(dataset["train"][0]) ## 引用格式 若您使用本数据集,请引用如下: bibtex @misc{pii_combined_dataset_2025, title={PII Detection Combined Dataset}, year={2025}, publisher={Hugging Face}, note={Combines ai4privacy English subset with synthetic challenging examples} } ## 许可证 MIT许可证 ## 数据集构建信息 - **构建日期**:2025-12-05 - **ai4privacy数据源**:[ai4privacy/open-pii-masking-500k-ai4privacy](https://huggingface.co/datasets/ai4privacy/open-pii-masking-500k-ai4privacy) - **合成数据生成**:xAI Grok-4.1-Non-reasoning - **合成数据验证**:OpenAI GPT-5.1 - **划分策略**:对合成数据进行80/20分层划分
提供机构:
Ari-S-123
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作