five

Robost-AI/PII_FInal

收藏
Hugging Face2026-03-25 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Robost-AI/PII_FInal
下载链接
链接失效反馈
官方服务:
资源简介:
# Multi-Text PII Classifier Dataset A document-level PII (Personally Identifiable Information) classification dataset with **26 classes**, built for fine-tuning **microsoft/deberta-v3-small**. The pipeline combines three source datasets, applies hierarchical rule-based label assignment, and produces a unified labeled dataset for training a multi-class text classifier. --- ## Table of Contents - [Task Overview](#task-overview) - [Class Taxonomy (26 Classes)](#class-taxonomy-26-classes) - [Source Datasets](#source-datasets) - [Label Assignment Pipeline](#label-assignment-pipeline) - [Output Data](#output-data) - [Class Distribution](#class-distribution) - [Project Structure](#project-structure) - [Model Details](#model-details) - [Pipeline Steps (Planned)](#pipeline-steps-planned) - [Usage](#usage) --- ## Task Overview **Goal:** Classify a given text document into one of 26 PII sensitivity categories based on the type and context of personally identifiable information it contains. This is a **document-level multi-class classification** task (not token-level NER). Each text sample receives exactly one label from the 26-class taxonomy. **Model:** `microsoft/deberta-v3-small` (~44M parameters, 512 max token context) --- ## Class Taxonomy (26 Classes) | ID | Class Name | Description | |----|-----------|-------------| | 0 | Employee Financial Information | Compensation data — salaries, bonuses, stock options, payroll, pension plans, direct deposit details | | 1 | Private Credit Agreements | Loan agreements, mortgage contracts, credit applications, amortization schedules | | 2 | Financial Projections | Forecasts, budgets, business plans, revenue estimates, cash flow projections | | 3 | Bulk PII | Large collections of PII records (4+ entity types per document) — CSV dumps, data exports | | 4 | Investment Portfolio Data | Securities prospectuses, portfolio holdings, financial disclosures, investment allocations | | 5 | Employee PII | Personal employee identifiers — SSN, DOB, address in employment/HR context | | 6 | Insurance Claims Data | Insurance claim forms (auto, property, dental, vision, disability, etc.) | | 7 | Customer Authentication Data | Login credentials, passwords, MFA tokens, VPN access, account recovery data | | 8 | Security Incident Reports | Data breach reports, cybersecurity assessments, vulnerability disclosures, incident logs | | 9 | Electronic Health Records | Clinical notes, diagnoses, lab results, medical histories, patient records | | 10 | Sales Pipeline Data | Sales agreements, revenue reports, e-commerce sales analytics, deal proposals | | 11 | Proprietary Source Code | Internal/proprietary codebases containing embedded secrets, API keys, or PII | | 12 | Billing and Payment Information | Invoices, bank statements, billing records, EDI/XBRL financial documents | | 13 | Source Code | General source code files with embedded PII (non-proprietary) | | 14 | Settlement and Dispute Resolution | Settlement agreements, mediation records, arbitration proceedings, complaint resolutions | | 15 | Stored Credit Cards | Credit card numbers, CVVs, expiry dates, cardholder data (PCI-DSS scope) | | 16 | General PII | Basic personal information — names, addresses, phone numbers, emails without specialized context | | 17 | Employment Records | Employment contracts, offer letters, job descriptions, HR policies (non-financial, non-identity) | | 18 | Customer PII | Customer-specific personal data — account details, contact info in customer/client context | | 19 | Payment Transactions | Wire transfers, SWIFT messages, cryptocurrency transactions, payment confirmations | | 20 | Protected Health Information | HIPAA-covered health data — insurance claims with medical context, PHI documents | | 21 | Legal Discourse | Regulatory filings, compliance certificates, governance guidelines, audit reports | | 22 | Mergers and Acquisitions | M&A agreements, takeover documents, buyout terms, divestiture records | | 23 | Access Keys | API keys, tokens, SSH keys, access credentials, service account secrets | | 24 | Clinical Trial Data | Study protocols, adverse event reports, enrollment data, randomized trial records | | 25 | NO PII | Documents containing no personally identifiable information | --- ## Source Datasets ### 1. `synthetic_pii_finance_english.csv` (Primary) - **Rows:** 28,910 - **Domain:** Finance, insurance, legal, healthcare - **Origin:** Synthetically generated documents with embedded PII spans - **Columns:** | Column | Description | |--------|-------------| | `document_type` | High-level document category (60 unique types, e.g., "Loan Agreement", "Health Insurance Claim Form") | | `document_description` | Natural language description of the document type | | `expanded_type` | Fine-grained subtype (1,692 unique values, e.g., "Vendor Management Contract") | | `expanded_description` | Description of the expanded subtype | | `language` | Language code (all English) | | `domain` | Domain category | | `generated_text` | The full synthetic document text | | `pii_spans` | JSON array of PII entity annotations: `[{"start": N, "end": M, "label": "entity_type"}, ...]` | | `conformance_score` | Quality metric — how well the document conforms to its type | | `quality_score` | Overall text quality score | | `toxicity_score` | Toxicity measure | | `bias_score` | Bias measure | | `groundedness_score` | Factual grounding measure | **PII entity types in `pii_spans`:** `date`, `email`, `phone_number`, `ssn`, `credit_card_number`, `credit_card_security_code`, `api_key`, `password`, `user_name`, `date_of_birth`, `account_number`, `routing_number`, `ip_address`, `swift_code`, `iban`, and 14 others (29 total). --- ### 2. `pii_masking_200k_english.csv` (Supplementary) - **Rows:** 43,501 - **Domain:** General (short text snippets) - **Origin:** PII masking dataset with source/target text pairs - **Avg text length:** ~43 tokens (short snippets) - **Columns:** | Column | Description | |--------|-------------| | `source_text` | Original text containing PII | | `target_text` | Text with PII masked (e.g., `[CREDITCARDNUMBER]`) | | `privacy_mask` | Python dict of PII spans: `[{"value": "...", "start": N, "end": M, "label": "TYPE"}, ...]` | | `span_labels` | Token-level BIO labels | | `mbert_text_tokens` | Tokenized text | | `mbert_bio_labels` | mBERT-aligned BIO labels | | `id` | Row identifier | | `language` | Language code | | `set` | Train/test split designation | **PII entity types:** 56 types including `FIRSTNAME`, `LASTNAME`, `SSN`, `CREDITCARDNUMBER`, `CREDITCARDCVV`, `PASSWORD`, `USERNAME`, `IPV4`, `IPV6`, `IBAN`, `ACCOUNTNUMBER`, `PHONENUMBER`, `EMAIL`, `STREET`, `CITY`, `STATE`, `ZIPCODE`, `DOB`, `JOBAREA`, and more. **Note:** The `privacy_mask` column uses Python dict format and requires `ast.literal_eval()` for parsing (not `json.loads()`). --- ### 3. `synthetic_multi_pii_ner_english.csv` (Small / Validation) - **Rows:** 452 - **Domain:** 5 domains — general (115), banking (92), finance (88), legal (82), healthcare (75) - **Origin:** Multi-entity NER dataset with rich entity annotations - **Purpose:** Primarily reserved for validation/test splits due to small size - **Columns:** | Column | Description | |--------|-------------| | `text` | Document text | | `language` | Language code | | `domain` | Domain category (general, banking, finance, legal, healthcare) | | `entities` | Entity annotations with types | | `gliner_tokenized_text` | GLiNER-tokenized text | | `gliner_entities` | GLiNER entity format annotations | --- ## Label Assignment Pipeline The label assignment is performed by `label_assignment.py`, which uses a **4-phase hierarchical rule-based approach** to map entity-level NER annotations and document metadata to one of the 26 document-level classes. ### Phase 0: NO PII Check Documents with empty `pii_spans` are assigned to class 25 (NO PII). ### Phase 1: Entity-Based Overrides (Highest Priority) Cross-cutting rules that fire regardless of `document_type`: - `api_key` entity present → **Access Keys** (23) - `credit_card_number` entity present → **Stored Credit Cards** (15) - `ssn` entity in employment context → **Employee PII** (5) ### Phase 2: Expanded-Type Keyword Overrides Keyword matching on the `expanded_type` field across all document types: - Security/breach keywords → **Security Incident Reports** (8) - Merger/acquisition keywords → **Mergers and Acquisitions** (22) - Sales/revenue keywords → **Sales Pipeline Data** (10) - Compensation/payroll keywords → **Employee Financial Information** (0) - Settlement/mediation keywords → **Settlement and Dispute Resolution** (14) - Clinical/medical record keywords → **Electronic Health Records** (9) ### Phase 3: Document-Type Rules Detailed mapping of all 60 `document_type` values to class labels, with further refinement by `expanded_type` sub-keywords. Examples: - "Insurance Claim Form" → Insurance Claims Data (6) or PHI (20), depending on subtype - "CSV" → Bulk PII (3) - "Loan Agreement" → Private Credit Agreements (1) - "Employment Contract" → Employment Records (17) or Employee PII (5) ### Phase 4: Fallback Unmatched documents are classified by remaining PII entity signals: - `password` entity → Customer Authentication Data (7) - 4+ entity types → Bulk PII (3) - Otherwise → General PII (16) ### Confidence Levels Each labeled example includes a confidence score: - **HIGH:** Strong signal from document type + entity match - **MEDIUM:** Keyword-based or contextual match - **LOW:** Fallback assignment or weak signal --- ## Output Data All output files are in `labeled_data/`. ### `labeled_all.csv` (Combined) - **Rows:** 36,367 - **Columns:** | Column | Type | Description | |--------|------|-------------| | `text` | string | The document text | | `label` | string | Assigned class name (e.g., "Protected Health Information") | | `label_id` | int | Numeric class ID (0–25) | | `confidence` | string | Assignment confidence: HIGH, MEDIUM, or LOW | | `source_dataset` | string | Origin dataset: `synthetic_pii_finance`, `pii_masking_200k`, or `synthetic_multi_pii_ner` | ### Per-Dataset Files - `labeled_finance.csv` — Labels from the finance dataset only - `labeled_masking.csv` — Labels from the PII masking dataset only - `labeled_ner.csv` — Labels from the NER dataset only ### `class_distribution.csv` Summary report with columns: `label_id`, `label`, `count`. --- ## Class Distribution Current label counts after running the label assignment pipeline: | ID | Class | Count | Status | |----|-------|------:|--------| | 0 | Employee Financial Information | 413 | LOW — needs augmentation | | 1 | Private Credit Agreements | 2,421 | OK | | 2 | Financial Projections | 809 | OK | | 3 | Bulk PII | 787 | LOW — needs augmentation | | 4 | Investment Portfolio Data | 2,430 | OK | | 5 | Employee PII | 65 | CRITICAL — needs synthetic data | | 6 | Insurance Claims Data | 2,154 | OK | | 7 | Customer Authentication Data | 189 | CRITICAL — needs synthetic data | | 8 | Security Incident Reports | 786 | LOW — needs augmentation | | 9 | Electronic Health Records | 151 | CRITICAL — needs synthetic data | | 10 | Sales Pipeline Data | 51 | CRITICAL — needs synthetic data | | 11 | Proprietary Source Code | 0 | CRITICAL — needs full synthetic generation | | 12 | Billing and Payment Information | 4,870 | OK (will be capped) | | 13 | Source Code | 0 | CRITICAL — needs full synthetic generation | | 14 | Settlement and Dispute Resolution | 385 | LOW — needs augmentation | | 15 | Stored Credit Cards | 687 | LOW — needs augmentation | | 16 | General PII | 1,986 | OK | | 17 | Employment Records | 227 | LOW — needs augmentation | | 18 | Customer PII | 1,002 | OK | | 19 | Payment Transactions | 4,553 | OK (will be capped) | | 20 | Protected Health Information | 7,379 | OK (will be capped at 2,000) | | 21 | Legal Discourse | 2,577 | OK | | 22 | Mergers and Acquisitions | 49 | CRITICAL — needs synthetic data | | 23 | Access Keys | 495 | LOW — needs augmentation | | 24 | Clinical Trial Data | 1 | CRITICAL — needs full synthetic generation | | 25 | NO PII | 1,900 | OK | **Classes needing full synthetic generation (0 or near-0 examples):** Proprietary Source Code, Source Code, Clinical Trial Data **Classes needing significant augmentation (<200 examples):** Employee PII, Customer Authentication Data, Electronic Health Records, Sales Pipeline Data, Mergers and Acquisitions --- ## Project Structure ``` datasets/ ├── README.md # This file ├── label_assignment.py # Label assignment pipeline (4-phase hierarchical rules) │ ├── synthetic_pii_finance_english.csv # Source: 28,910 synthetic finance/legal/health documents ├── pii_masking_200k_english.csv # Source: 43,501 short PII masking snippets ├── synthetic_multi_pii_ner_english.csv # Source: 452 multi-entity NER samples (5 domains) │ ├── download_english_finance_pii.py # Download script for finance dataset ├── download_english_pii_masking_200k.py # Download script for masking dataset ├── download_english_pii_ner.py # Download script for NER dataset │ └── labeled_data/ ├── labeled_all.csv # Combined labeled data (36,367 rows) ├── labeled_finance.csv # Finance dataset labels ├── labeled_masking.csv # Masking dataset labels ├── labeled_ner.csv # NER dataset labels └── class_distribution.csv # Class distribution summary ``` --- ## Model Details | Property | Value | |----------|-------| | Model | `microsoft/deberta-v3-small` | | Parameters | ~44M | | Max Tokens | 512 (with position extension) | | Task | Multi-class classification (26 classes) | | Truncation Strategy | Head+tail (384 head + 128 tail tokens) | | Loss Function | Class-weighted cross-entropy (to handle imbalance) | | Target per Class | ~1,200 examples (800 minimum, 2,000 cap) | | Target Total | ~31,200 examples | | Split | 70% train / 15% validation / 15% test (stratified) | --- ## Pipeline Steps (Planned) | Step | Script | Status | Description | |------|--------|--------|-------------| | 1 | `label_assignment.py` | Done | Assign 26-class labels to all three source datasets | | 2 | `generate_synthetic.py` | Pending | Generate synthetic data for ~10 underrepresented classes | | 3 | `preprocess.py` | Pending | Text preprocessing, deduplication, class balancing, train/val/test split | | 4 | `train.py` | Pending | Fine-tune DeBERTa-v3-small with class-weighted loss | | 5 | `evaluate.py` | Pending | Per-class metrics, confusion matrix, error analysis | --- ## Usage ### Run Label Assignment ```bash cd datasets/ python label_assignment.py ``` This reads all three source CSVs, applies the hierarchical labeling rules, and writes output to `labeled_data/`. ### Load Labeled Data ```python import pandas as pd df = pd.read_csv("labeled_data/labeled_all.csv") print(df["label"].value_counts()) # Filter by confidence high_conf = df[df["confidence"] == "HIGH"] # Filter by source finance_only = df[df["source_dataset"] == "synthetic_pii_finance"] ```
提供机构:
Robost-AI
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作