Robost-AI/PII_FInal
收藏Hugging Face2026-03-25 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Robost-AI/PII_FInal
下载链接
链接失效反馈官方服务:
资源简介:
# Multi-Text PII Classifier Dataset
A document-level PII (Personally Identifiable Information) classification dataset with **26 classes**, built for fine-tuning **microsoft/deberta-v3-small**. The pipeline combines three source datasets, applies hierarchical rule-based label assignment, and produces a unified labeled dataset for training a multi-class text classifier.
---
## Table of Contents
- [Task Overview](#task-overview)
- [Class Taxonomy (26 Classes)](#class-taxonomy-26-classes)
- [Source Datasets](#source-datasets)
- [Label Assignment Pipeline](#label-assignment-pipeline)
- [Output Data](#output-data)
- [Class Distribution](#class-distribution)
- [Project Structure](#project-structure)
- [Model Details](#model-details)
- [Pipeline Steps (Planned)](#pipeline-steps-planned)
- [Usage](#usage)
---
## Task Overview
**Goal:** Classify a given text document into one of 26 PII sensitivity categories based on the type and context of personally identifiable information it contains.
This is a **document-level multi-class classification** task (not token-level NER). Each text sample receives exactly one label from the 26-class taxonomy.
**Model:** `microsoft/deberta-v3-small` (~44M parameters, 512 max token context)
---
## Class Taxonomy (26 Classes)
| ID | Class Name | Description |
|----|-----------|-------------|
| 0 | Employee Financial Information | Compensation data — salaries, bonuses, stock options, payroll, pension plans, direct deposit details |
| 1 | Private Credit Agreements | Loan agreements, mortgage contracts, credit applications, amortization schedules |
| 2 | Financial Projections | Forecasts, budgets, business plans, revenue estimates, cash flow projections |
| 3 | Bulk PII | Large collections of PII records (4+ entity types per document) — CSV dumps, data exports |
| 4 | Investment Portfolio Data | Securities prospectuses, portfolio holdings, financial disclosures, investment allocations |
| 5 | Employee PII | Personal employee identifiers — SSN, DOB, address in employment/HR context |
| 6 | Insurance Claims Data | Insurance claim forms (auto, property, dental, vision, disability, etc.) |
| 7 | Customer Authentication Data | Login credentials, passwords, MFA tokens, VPN access, account recovery data |
| 8 | Security Incident Reports | Data breach reports, cybersecurity assessments, vulnerability disclosures, incident logs |
| 9 | Electronic Health Records | Clinical notes, diagnoses, lab results, medical histories, patient records |
| 10 | Sales Pipeline Data | Sales agreements, revenue reports, e-commerce sales analytics, deal proposals |
| 11 | Proprietary Source Code | Internal/proprietary codebases containing embedded secrets, API keys, or PII |
| 12 | Billing and Payment Information | Invoices, bank statements, billing records, EDI/XBRL financial documents |
| 13 | Source Code | General source code files with embedded PII (non-proprietary) |
| 14 | Settlement and Dispute Resolution | Settlement agreements, mediation records, arbitration proceedings, complaint resolutions |
| 15 | Stored Credit Cards | Credit card numbers, CVVs, expiry dates, cardholder data (PCI-DSS scope) |
| 16 | General PII | Basic personal information — names, addresses, phone numbers, emails without specialized context |
| 17 | Employment Records | Employment contracts, offer letters, job descriptions, HR policies (non-financial, non-identity) |
| 18 | Customer PII | Customer-specific personal data — account details, contact info in customer/client context |
| 19 | Payment Transactions | Wire transfers, SWIFT messages, cryptocurrency transactions, payment confirmations |
| 20 | Protected Health Information | HIPAA-covered health data — insurance claims with medical context, PHI documents |
| 21 | Legal Discourse | Regulatory filings, compliance certificates, governance guidelines, audit reports |
| 22 | Mergers and Acquisitions | M&A agreements, takeover documents, buyout terms, divestiture records |
| 23 | Access Keys | API keys, tokens, SSH keys, access credentials, service account secrets |
| 24 | Clinical Trial Data | Study protocols, adverse event reports, enrollment data, randomized trial records |
| 25 | NO PII | Documents containing no personally identifiable information |
---
## Source Datasets
### 1. `synthetic_pii_finance_english.csv` (Primary)
- **Rows:** 28,910
- **Domain:** Finance, insurance, legal, healthcare
- **Origin:** Synthetically generated documents with embedded PII spans
- **Columns:**
| Column | Description |
|--------|-------------|
| `document_type` | High-level document category (60 unique types, e.g., "Loan Agreement", "Health Insurance Claim Form") |
| `document_description` | Natural language description of the document type |
| `expanded_type` | Fine-grained subtype (1,692 unique values, e.g., "Vendor Management Contract") |
| `expanded_description` | Description of the expanded subtype |
| `language` | Language code (all English) |
| `domain` | Domain category |
| `generated_text` | The full synthetic document text |
| `pii_spans` | JSON array of PII entity annotations: `[{"start": N, "end": M, "label": "entity_type"}, ...]` |
| `conformance_score` | Quality metric — how well the document conforms to its type |
| `quality_score` | Overall text quality score |
| `toxicity_score` | Toxicity measure |
| `bias_score` | Bias measure |
| `groundedness_score` | Factual grounding measure |
**PII entity types in `pii_spans`:** `date`, `email`, `phone_number`, `ssn`, `credit_card_number`, `credit_card_security_code`, `api_key`, `password`, `user_name`, `date_of_birth`, `account_number`, `routing_number`, `ip_address`, `swift_code`, `iban`, and 14 others (29 total).
---
### 2. `pii_masking_200k_english.csv` (Supplementary)
- **Rows:** 43,501
- **Domain:** General (short text snippets)
- **Origin:** PII masking dataset with source/target text pairs
- **Avg text length:** ~43 tokens (short snippets)
- **Columns:**
| Column | Description |
|--------|-------------|
| `source_text` | Original text containing PII |
| `target_text` | Text with PII masked (e.g., `[CREDITCARDNUMBER]`) |
| `privacy_mask` | Python dict of PII spans: `[{"value": "...", "start": N, "end": M, "label": "TYPE"}, ...]` |
| `span_labels` | Token-level BIO labels |
| `mbert_text_tokens` | Tokenized text |
| `mbert_bio_labels` | mBERT-aligned BIO labels |
| `id` | Row identifier |
| `language` | Language code |
| `set` | Train/test split designation |
**PII entity types:** 56 types including `FIRSTNAME`, `LASTNAME`, `SSN`, `CREDITCARDNUMBER`, `CREDITCARDCVV`, `PASSWORD`, `USERNAME`, `IPV4`, `IPV6`, `IBAN`, `ACCOUNTNUMBER`, `PHONENUMBER`, `EMAIL`, `STREET`, `CITY`, `STATE`, `ZIPCODE`, `DOB`, `JOBAREA`, and more.
**Note:** The `privacy_mask` column uses Python dict format and requires `ast.literal_eval()` for parsing (not `json.loads()`).
---
### 3. `synthetic_multi_pii_ner_english.csv` (Small / Validation)
- **Rows:** 452
- **Domain:** 5 domains — general (115), banking (92), finance (88), legal (82), healthcare (75)
- **Origin:** Multi-entity NER dataset with rich entity annotations
- **Purpose:** Primarily reserved for validation/test splits due to small size
- **Columns:**
| Column | Description |
|--------|-------------|
| `text` | Document text |
| `language` | Language code |
| `domain` | Domain category (general, banking, finance, legal, healthcare) |
| `entities` | Entity annotations with types |
| `gliner_tokenized_text` | GLiNER-tokenized text |
| `gliner_entities` | GLiNER entity format annotations |
---
## Label Assignment Pipeline
The label assignment is performed by `label_assignment.py`, which uses a **4-phase hierarchical rule-based approach** to map entity-level NER annotations and document metadata to one of the 26 document-level classes.
### Phase 0: NO PII Check
Documents with empty `pii_spans` are assigned to class 25 (NO PII).
### Phase 1: Entity-Based Overrides (Highest Priority)
Cross-cutting rules that fire regardless of `document_type`:
- `api_key` entity present → **Access Keys** (23)
- `credit_card_number` entity present → **Stored Credit Cards** (15)
- `ssn` entity in employment context → **Employee PII** (5)
### Phase 2: Expanded-Type Keyword Overrides
Keyword matching on the `expanded_type` field across all document types:
- Security/breach keywords → **Security Incident Reports** (8)
- Merger/acquisition keywords → **Mergers and Acquisitions** (22)
- Sales/revenue keywords → **Sales Pipeline Data** (10)
- Compensation/payroll keywords → **Employee Financial Information** (0)
- Settlement/mediation keywords → **Settlement and Dispute Resolution** (14)
- Clinical/medical record keywords → **Electronic Health Records** (9)
### Phase 3: Document-Type Rules
Detailed mapping of all 60 `document_type` values to class labels, with further refinement by `expanded_type` sub-keywords. Examples:
- "Insurance Claim Form" → Insurance Claims Data (6) or PHI (20), depending on subtype
- "CSV" → Bulk PII (3)
- "Loan Agreement" → Private Credit Agreements (1)
- "Employment Contract" → Employment Records (17) or Employee PII (5)
### Phase 4: Fallback
Unmatched documents are classified by remaining PII entity signals:
- `password` entity → Customer Authentication Data (7)
- 4+ entity types → Bulk PII (3)
- Otherwise → General PII (16)
### Confidence Levels
Each labeled example includes a confidence score:
- **HIGH:** Strong signal from document type + entity match
- **MEDIUM:** Keyword-based or contextual match
- **LOW:** Fallback assignment or weak signal
---
## Output Data
All output files are in `labeled_data/`.
### `labeled_all.csv` (Combined)
- **Rows:** 36,367
- **Columns:**
| Column | Type | Description |
|--------|------|-------------|
| `text` | string | The document text |
| `label` | string | Assigned class name (e.g., "Protected Health Information") |
| `label_id` | int | Numeric class ID (0–25) |
| `confidence` | string | Assignment confidence: HIGH, MEDIUM, or LOW |
| `source_dataset` | string | Origin dataset: `synthetic_pii_finance`, `pii_masking_200k`, or `synthetic_multi_pii_ner` |
### Per-Dataset Files
- `labeled_finance.csv` — Labels from the finance dataset only
- `labeled_masking.csv` — Labels from the PII masking dataset only
- `labeled_ner.csv` — Labels from the NER dataset only
### `class_distribution.csv`
Summary report with columns: `label_id`, `label`, `count`.
---
## Class Distribution
Current label counts after running the label assignment pipeline:
| ID | Class | Count | Status |
|----|-------|------:|--------|
| 0 | Employee Financial Information | 413 | LOW — needs augmentation |
| 1 | Private Credit Agreements | 2,421 | OK |
| 2 | Financial Projections | 809 | OK |
| 3 | Bulk PII | 787 | LOW — needs augmentation |
| 4 | Investment Portfolio Data | 2,430 | OK |
| 5 | Employee PII | 65 | CRITICAL — needs synthetic data |
| 6 | Insurance Claims Data | 2,154 | OK |
| 7 | Customer Authentication Data | 189 | CRITICAL — needs synthetic data |
| 8 | Security Incident Reports | 786 | LOW — needs augmentation |
| 9 | Electronic Health Records | 151 | CRITICAL — needs synthetic data |
| 10 | Sales Pipeline Data | 51 | CRITICAL — needs synthetic data |
| 11 | Proprietary Source Code | 0 | CRITICAL — needs full synthetic generation |
| 12 | Billing and Payment Information | 4,870 | OK (will be capped) |
| 13 | Source Code | 0 | CRITICAL — needs full synthetic generation |
| 14 | Settlement and Dispute Resolution | 385 | LOW — needs augmentation |
| 15 | Stored Credit Cards | 687 | LOW — needs augmentation |
| 16 | General PII | 1,986 | OK |
| 17 | Employment Records | 227 | LOW — needs augmentation |
| 18 | Customer PII | 1,002 | OK |
| 19 | Payment Transactions | 4,553 | OK (will be capped) |
| 20 | Protected Health Information | 7,379 | OK (will be capped at 2,000) |
| 21 | Legal Discourse | 2,577 | OK |
| 22 | Mergers and Acquisitions | 49 | CRITICAL — needs synthetic data |
| 23 | Access Keys | 495 | LOW — needs augmentation |
| 24 | Clinical Trial Data | 1 | CRITICAL — needs full synthetic generation |
| 25 | NO PII | 1,900 | OK |
**Classes needing full synthetic generation (0 or near-0 examples):** Proprietary Source Code, Source Code, Clinical Trial Data
**Classes needing significant augmentation (<200 examples):** Employee PII, Customer Authentication Data, Electronic Health Records, Sales Pipeline Data, Mergers and Acquisitions
---
## Project Structure
```
datasets/
├── README.md # This file
├── label_assignment.py # Label assignment pipeline (4-phase hierarchical rules)
│
├── synthetic_pii_finance_english.csv # Source: 28,910 synthetic finance/legal/health documents
├── pii_masking_200k_english.csv # Source: 43,501 short PII masking snippets
├── synthetic_multi_pii_ner_english.csv # Source: 452 multi-entity NER samples (5 domains)
│
├── download_english_finance_pii.py # Download script for finance dataset
├── download_english_pii_masking_200k.py # Download script for masking dataset
├── download_english_pii_ner.py # Download script for NER dataset
│
└── labeled_data/
├── labeled_all.csv # Combined labeled data (36,367 rows)
├── labeled_finance.csv # Finance dataset labels
├── labeled_masking.csv # Masking dataset labels
├── labeled_ner.csv # NER dataset labels
└── class_distribution.csv # Class distribution summary
```
---
## Model Details
| Property | Value |
|----------|-------|
| Model | `microsoft/deberta-v3-small` |
| Parameters | ~44M |
| Max Tokens | 512 (with position extension) |
| Task | Multi-class classification (26 classes) |
| Truncation Strategy | Head+tail (384 head + 128 tail tokens) |
| Loss Function | Class-weighted cross-entropy (to handle imbalance) |
| Target per Class | ~1,200 examples (800 minimum, 2,000 cap) |
| Target Total | ~31,200 examples |
| Split | 70% train / 15% validation / 15% test (stratified) |
---
## Pipeline Steps (Planned)
| Step | Script | Status | Description |
|------|--------|--------|-------------|
| 1 | `label_assignment.py` | Done | Assign 26-class labels to all three source datasets |
| 2 | `generate_synthetic.py` | Pending | Generate synthetic data for ~10 underrepresented classes |
| 3 | `preprocess.py` | Pending | Text preprocessing, deduplication, class balancing, train/val/test split |
| 4 | `train.py` | Pending | Fine-tune DeBERTa-v3-small with class-weighted loss |
| 5 | `evaluate.py` | Pending | Per-class metrics, confusion matrix, error analysis |
---
## Usage
### Run Label Assignment
```bash
cd datasets/
python label_assignment.py
```
This reads all three source CSVs, applies the hierarchical labeling rules, and writes output to `labeled_data/`.
### Load Labeled Data
```python
import pandas as pd
df = pd.read_csv("labeled_data/labeled_all.csv")
print(df["label"].value_counts())
# Filter by confidence
high_conf = df[df["confidence"] == "HIGH"]
# Filter by source
finance_only = df[df["source_dataset"] == "synthetic_pii_finance"]
```
提供机构:
Robost-AI



