five

Ari-S-123/pii-detection-english-consolidated

收藏
Hugging Face2025-12-07 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/Ari-S-123/pii-detection-english-consolidated
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: source_text dtype: string - name: privacy_mask list: - name: end dtype: int64 - name: label dtype: string - name: label_index dtype: int64 - name: start dtype: int64 - name: value dtype: string - name: feature_dimension dtype: string - name: seed_pii_type dtype: string - name: seed_pii_value dtype: string - name: seed_pii_locale dtype: string - name: scenario dtype: string - name: type_variant dtype: string - name: generation_id dtype: string - name: data_source dtype: string - name: language dtype: string - name: region dtype: string - name: script dtype: string splits: - name: train num_bytes: 40500982 num_examples: 125327 - name: test num_bytes: 10102881 num_examples: 31361 download_size: 19401115 dataset_size: 50603863 configs: - config_name: default data_files: - split: train path: data/train-* - split: test path: data/test-* license: mit task_categories: - token-classification language: - en tags: - pii - ner - privacy - synthetic-data size_categories: - 100K<n<1M --- # PII Detection Combined Dataset Combined dataset for PII (Personally Identifiable Information) detection, merging the ai4privacy English-only subset with synthetically generated and semantically validated with different LLMs challenging examples targeting NER failure modes. Class labels had to be consolidated to prevent label fragmentation too. ## Dataset Description This dataset combines two sources: 1. **[ai4privacy/open-pii-masking-500k](https://huggingface.co/datasets/ai4privacy/open-pii-masking-500k-ai4privacy)** (English subset): 120,533 train / 30,160 test examples 2. **Synthetic data** (Grok-4.1-Non-reasoning generated/GPT-5.1 validated): 4,801 train / 1,201 test examples **Total**: 125,334 train / 31,361 test examples ### Synthetic Data Feature Dimensions The synthetic data specifically targets six NER failure mode dimensions from [Singh & Narayanan (2025) "Unmasking the Reality of PII Masking Models: Performance Gaps and the Call for Accountability"](https://arxiv.org/abs/2504.12308): | Dimension | Train Count | Description | |-----------|-------------|-------------| | adversarial | 464 | Intentionally deceptive patterns | | basic | 989 | Standard, well-formatted entities | | contextual | 801 | Ambiguous entities requiring context | | evolving | 754 | Modern/emerging PII formats | | multilingual | 917 | International formats in English | | noisy | 876 | Real-world text imperfections | ## Dataset Schema | Field | Type | Description | |-------|------|-------------| | `source_text` | string | Original text containing PII entities | | `privacy_mask` | list | Entity annotations with label, start, end, value | | `data_source` | string | Either "ai4privacy" or "synthetic" | | `feature_dimension` | string | NER challenge dimension (synthetic only) | | `language` | string | Language code (always "en") | ## Usage ```python from datasets import load_dataset # Load from HuggingFace Hub dataset = load_dataset("Ari-S-123/better-english-pii-anonymizer") # Or load from local Parquet files dataset = load_dataset("parquet", data_files={ "train": "train.parquet", "test": "test.parquet" }) # Access examples print(dataset["train"][0]) ``` ## Citation If you use this dataset, please cite: ```bibtex @misc{pii_combined_dataset_2025, title={PII Detection Combined Dataset}, year={2025}, publisher={Hugging Face}, note={Combines ai4privacy English subset with synthetic challenging examples} } ``` ## License MIT License ## Dataset Creation - **Created**: 2025-12-05 - **ai4privacy source**: [ai4privacy/open-pii-masking-500k-ai4privacy](https://huggingface.co/datasets/ai4privacy/open-pii-masking-500k-ai4privacy) - **Synthetic generation**: xAI Grok-4.1-Non-Reasoning - **Synthetic validation**: OpenAI GPT-5.1 Low Reasoning Effort - **Split strategy**: 80/20 stratified split on synthetic data

数据集元信息: 特征字段: - 名称:source_text,数据类型:字符串 - 名称:privacy_mask,数据类型:列表,子字段包括: - end:int64类型,实体结束位置 - label:字符串类型,实体标签 - label_index:int64类型,标签索引 - start:int64类型,实体起始位置 - value:字符串类型,实体值 - 名称:feature_dimension,数据类型:字符串 - 名称:seed_pii_type,数据类型:字符串 - 名称:seed_pii_value,数据类型:字符串 - 名称:seed_pii_locale,数据类型:字符串 - 名称:scenario,数据类型:字符串 - 名称:type_variant,数据类型:字符串 - 名称:generation_id,数据类型:字符串 - 名称:data_source,数据类型:字符串 - 名称:language,数据类型:字符串 - 名称:region,数据类型:字符串 - 名称:script,数据类型:字符串 数据集划分: - 划分集名称:train,字节数:40500982,样本数量:125327 - 划分集名称:test,字节数:10102881,样本数量:31361 下载大小:19401115字节,总数据集大小:50603863字节 配置项: - 配置名称:default,数据文件路径: - 划分集train:data/train-* - 划分集test:data/test-* 许可证:MIT 任务类别:令牌分类(token-classification) 语言:en(英文) 标签:PII(个人可识别信息,Personally Identifiable Information)、NER(命名实体识别,Named Entity Recognition)、privacy(隐私)、synthetic-data(合成数据) 样本规模区间:100K < n < 1M # PII检测组合数据集(PII, Personally Identifiable Information,个人可识别信息) 本数据集用于个人可识别信息(PII, Personally Identifiable Information)检测任务,整合了ai4privacy的英文子集,以及由不同大语言模型(LLM)生成并经语义验证的、针对命名实体识别(NER, Named Entity Recognition)失效模式的挑战性合成样本。同时为避免标签碎片化,对类别标签进行了统一整合。 ## 数据集描述 本数据集整合了两类数据源: 1. **[ai4privacy/open-pii-masking-500k](https://huggingface.co/datasets/ai4privacy/open-pii-masking-500k-ai4privacy)(英文子集)**:训练集120,533条,测试集30,160条 2. **合成数据**:由xAI Grok-4.1-非推理模式生成、OpenAI GPT-5.1验证:训练集4,801条,测试集1,201条 **总规模**:训练集125,334条,测试集31,361条 ### 合成数据特征维度 合成数据专门针对[Singh & Narayanan (2025)《揭开PII掩码模型的真实面纱:性能差距与问责呼吁》](https://arxiv.org/abs/2504.12308)中提出的6类NER失效模式维度设计: | 维度 | 训练集数量 | 描述 | |--------------|------------|--------------------------| | 对抗性样本 | 464 | 蓄意设计的欺骗性模式 | | 基础样本 | 989 | 格式规范的标准实体 | | 上下文依赖 | 801 | 需要结合上下文才能识别的歧义实体 | | 新兴格式 | 754 | 现代/新兴的PII格式 | | 多语言适配 | 917 | 英文语境下的国际通用格式 | | 噪声样本 | 876 | 包含真实文本瑕疵的样本 | ## 数据集模式 | 字段名 | 数据类型 | 描述 | |---------------------|----------|----------------------------------------------------------------------| | `source_text` | 字符串 | 包含PII实体的原始文本 | | `privacy_mask` | 列表 | 实体标注集合,包含标签、起始位置、结束位置、实体值 | | `data_source` | 字符串 | 数据源类型,取值为"ai4privacy"或"synthetic" | | `feature_dimension` | 字符串 | NER挑战维度(仅合成数据包含该字段) | | `language` | 字符串 | 语言代码,固定为"en"(英文) | ## 使用方法 python from datasets import load_dataset # 从Hugging Face Hub加载数据集 dataset = load_dataset("Ari-S-123/better-english-pii-anonymizer") # 或从本地Parquet文件加载 dataset = load_dataset("parquet", data_files={ "train": "train.parquet", "test": "test.parquet" }) # 访问样本示例 print(dataset["train"][0]) ## 引用信息 若使用本数据集,请引用: bibtex @misc{pii_combined_dataset_2025, title={PII检测组合数据集}, year={2025}, publisher={Hugging Face}, note={整合了ai4privacy英文子集与合成挑战性样本} } ## 许可证 MIT许可证 ## 数据集构建信息 - **创建时间**:2025-12-05 - **ai4privacy数据源**:[ai4privacy/open-pii-masking-500k-ai4privacy](https://huggingface.co/datasets/ai4privacy/open-pii-masking-500k-ai4privacy) - **合成数据生成**:xAI Grok-4.1-非推理模式 - **合成数据验证**:OpenAI GPT-5.1 低推理工作量模式 - **划分策略**:对合成数据采用80/20分层划分
提供机构:
Ari-S-123
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作