ASQ-PHI: An Adversarial Synthetic Benchmark for Clinical De-Identification and Search Utility
收藏NIAID Data Ecosystem2026-05-10 收录
下载链接:
https://data.mendeley.com/datasets/csz5dzp7nx
下载链接
链接失效反馈官方服务:
资源简介:
Hospitals are beginning to deploy HIPAA-compliant Business Associate Agreement (BAA) large language models (LLMs). In the public setting, LLMs with fixed training cutoffs are routinely augmented with tools such as web search, deep research, and Model Context Protocol (MCP) servers so they can reach up-to-date knowledge. BAA LLMs, by contrast, almost never expose live web search or external tools, even though clinicians expect LLMs to surface current guidelines, drug safety updates, and literature. The constraint is that any query leaving a BAA-protected LLM for an external service must be free of Protected Health Information (PHI). We refer to this boundary as the safe handoff: the moment when a clinician’s PHI-containing query, generated inside a HIPAA-compliant BAA LLM, must be transformed into a HIPAA Safe Harbor–compliant version before being sent to non-BAA tools such as web search APIs, external evidence services, or MCP servers. Existing de-identification datasets are built from long electronic health record narratives rather than the short, compressed search queries clinicians type into LLM chat interfaces, so they do not allow de-identification performance to be tested at this safe handoff. ASQ-PHI (Adversarial Synthetic Queries for Protected Health Information de-identification) is constructed to supply this missing data: a benchmark of 1,051 fully synthetic clinical search queries with ground-truth PHI annotations for stress-testing HIPAA-compliant de-identification software. All queries were generated using Azure OpenAI GPT-4o. No real patient data were used.
Research hypothesis:
Current de-identification systems fail at the safe handoff from LLMs running inside HIPAA BAAs to external tools in two ways:
1) Leaking PHI
2) Over-redacting non-identifying clinical information reducing query utility.
What the data shows:
The dataset contains 832 PHI-positive queries (79.2 percent) and 219 hard negatives (20.8 percent), totaling 2,973 tagged PHI elements across 13 HIPAA Safe Harbor types. Most common: GEOGRAPHIC_LOCATION (27.8 percent), NAME (27.4 percent), DATE (27.1 percent). Hard negatives mimic PHI-containing text but contain no identifiers, enabling over-redaction measurement.
Notable findings:
Baseline validation using Amazon Comprehend Medical (DetectPHI) revealed severe recall-utility tradeoffs, as shown in validation_results.
How the data was generated:
Queries were generated using GPT-4o. A system prompt specified natural clinical questions containing 0 to 5 HIPAA identifiers. Output format: "===QUERY===" followed by "===PHI_TAGS===" with JSON {"identifier_type": "TYPE", "value": "..."}. Complete code in code/data_generation_pipeline.ipynb.
The dataset serves developers building HIPAA-compliant LLM infrastructure, privacy researchers benchmarking algorithms, and compliance teams evaluating vendor claims.
创建时间:
2025-11-21



