guardion/BR-Agentic-PII-Benchmark
收藏Hugging Face2026-04-06 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/guardion/BR-Agentic-PII-Benchmark
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- pt
license: mit
task_categories:
- text-generation
- token-classification
tags:
- pii
- synthetic
- llm-agents
- anpd
- brazil
- banking
- finance
size_categories:
- n<1K
---
# BR-Agentic-PII-Benchmark
## Overview
**BR-Agentic-PII-Benchmark** is a highly annotated dataset of synthetic multi-turn conversations between humans and AI banking assistants in Brazilian Portuguese. It is designed to benchmark the processes of **detection, anonymization, de-anonymization, and transparency** for AI agents.
### Key Use Cases
This dataset helps evaluate systems that redact and protect Personally Identifiable Information (PII) before it leaves a secure perimeter, specifically aimed at:
- **Client-facing Chatbots:** Scanning tool responses before generating user responses.
- **Coding & Internal Agents:** Ensuring confidential identifiers aren't leaked.
- **External API Routing:** Redacting data prior to dispatching requests to external, third-party LLMs.
---
## Dataset Description
The dataset simulates realistic multi-agent tool-use interactions. Instead of classic static text formats, the records track entire conversations where an LLM functions as an agent calling JSON-structured tools to retrieve banking information.
### Structure & Fields
Each record in the `.jsonl` file represents a single, complete multi-turn conversation containing the following top-level schema:
- `conversation_id`: A unique UUID for the generated conversation.
- `persona_id`: The underlying demographic profile ID assigned to the synthetic user.
- `scenario`: A string identifier defining the specific interaction context (e.g., `consulta_saldo`, `transferencia_pix`).
- `pii_ratio`: A float representing the ratio of successfully injected PII against the scenario's required placeholders.
- `pii_summary`: A nested dictionary tallying the total counts of PII injected across different categories and message roles.
- `persona`: A nested object storing the raw generated identity variables used (Name, CPF, Income, Address, etc.) prior to injection.
- `messages`: An array of interactions depicting the dialogue, consisting of:
- `role`: (`system`, `user`, `assistant`, `tool`).
- `content`: The raw text string or JSON representation.
- `tool_calls` (optional): The functional tool execution payload if the assistant dispatched an API call.
- `pii_spans`: The precise array tracking PII boundaries inside this specific message. Contains:
- `text`: The exact PII string.
- `label`: Entity class (e.g., `PERSON_NAME`).
- `start` & `end`: Exact character offsets.
- `field_type`: Whether the span was identified inside unstructured chat (`text_content`), agent execution parameters (`tool_argument`), or simulated API callbacks (`tool_result`).
### PII Taxonomy
The dataset strictly maps 10 comprehensive, localized Brazilian PIIs:
- `PERSON_NAME`: Full or partial names mapped alongside demographic logic.
- `CPF`: Valid modulo-11 formatted individual taxpayer registry IDs.
- `PHONE_NUMBER`: Landlines and cellphones with localized state prefixes (DDDs).
- `EMAIL`: Realistic domains and formatting.
- `STREET_ADDRESS`: Valid IBGE municipalities, street names, and postal codes.
- `PIX_KEY`: Authentic instant payment keys (Phone, Email, CPF, Random).
- `BANK_ACCOUNT` & `BANK_BRANCH`: Valid FEBRABAN complaint branch/account numbering formats.
- `FINANCIAL_VALUE`: Account balances, simulated salaries, and transaction limits.
- `CREDIT_CARD`: Realistic credit card spans.
---
## Extensibility and Future Horizons
The architecture of this dataset pipeline is highly extensible. While this initial version focuses on retail banking consumer PII, the methodology can be expanded to benchmark other high-risk agentic contexts:
- **Coding & Internal Orchestration Agents:** Detecting critical data leaks beyond standard PII, including API keys, JWT tokens, AWS credentials, and proprietary database secrets before they hit third-party APIs.
- **Corporate & B2B Entities:** Generating enterprise contexts containing CNPJs, corporate addresses, trade secrets, and legal entity structures.
- **Adversarial & Jailbreak Scenarios:** Simulating prompt injections or complex social engineering logic designed to trick the agent into inappropriately fetching and disclosing sensitive information via its internal tools.
- **Special Category Sensitive Data:** Introducing highly sensitive contexts protected strictly by LGPD/GDPR regulations (e.g., medical histories, union affiliations, religious beliefs) designed for interactions with HR or Healthcare agents.
- **Cross-Lingual Workflows:** Extending the persona formats to evaluate zero-shot PII detection resilience across multilingual deployments.
---
## Methodology
To guarantee **100% accurate PII spans** without LLM hallucination:
1. **Persona Assembly:** Uses real demographics logic (IBGE) and localization (Faker pt_BR).
2. **Generative Synthesis:** A Gemini model enacts banking scenarios but strictly outputs placeholders (e.g., `Meu CPF é {CPF}`).
3. **Deterministic Injection:** An AST pipeline substitutes placeholders with generated data into human text, tool callbacks, and JSON arguments, perfectly recording character span start/end integers.
---
## License
**MIT License**
Contains fully synthetic data. No real personal data is included.
提供机构:
guardion



