mabo1215/CPPB
收藏Hugging Face2026-04-08 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/mabo1215/CPPB
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
pretty_name: Controlled Prompt-Privacy Benchmark
size_categories:
- n<1K
task_categories:
- text-generation
- text-classification
task_ids:
- named-entity-recognition
- document-question-answering
- text2text-generation
tags:
- privacy
- prompt-security
- de-identification
- redaction
- llm-agents
- evaluation
license: other
configs:
- config_name: default
data_files:
- split: train
path: data/train.csv
- split: validation
path: data/dev.csv
- split: test
path: data/test.csv
---
# CPPB
## Summary
CPPB is the public release surface for the Controlled Prompt-Privacy Benchmark introduced in [BodhiPromptShield: Pre-Inference Prompt Mediation for Suppressing Privacy Propagation in LLM/VLM Agents](https://arxiv.org/abs/2604.05793).
This Hugging Face package intentionally releases the benchmark-authored prompt manifest and template-stratified train/dev/test split, not raw third-party prompts, source images, or end-to-end OCR assets. Each row is a controlled prompt stub with benchmark metadata that supports reproducible accounting, split-auditability, and benchmark discoverability.
## What Is Released
- 256 benchmark-authored prompt-manifest rows derived from 32 templates x 8 variants.
- Deterministic template-disjoint `train` / `dev` / `test` split: 128 / 64 / 64 rows.
- Prompt-family, privacy-category, subset, modality, and provenance fields needed to reconstruct the released benchmark card.
- Companion release notes and licensing/provenance manifests in the repository bundle.
## What Is Not Released
- Raw user prompts or third-party prompt payloads.
- Original OCR source assets, screenshots, or scanned documents.
- Licensed clinical notes or private operational logs.
- Exact multimodal regeneration assets beyond the released manifest surface.
## Dataset Structure
Main columns:
- `prompt_id`: unique prompt instance identifier.
- `template_id`: template identifier shared across the eight fixed variants.
- `variant_id`: one of `V1`-`V8`.
- `prompt_family`: one of Direct requests, Document-oriented, Retrieval-style, Tool-oriented agent.
- `prompt_source`: benchmark-authored source family.
- `downstream_task_type`: Prompt QA, Document QA, Retrieval QA, or Agent execution.
- `primary_privacy_category`: dominant protected-content category.
- `subset`: Essential-privacy or Incidental-privacy.
- `modality`: Text-only or OCR-mediated text-plus-image.
- `template_stub`: compact template-level description.
- `prompt_stub`: compact prompt-instance description.
- `split`: released split membership.
## Intended Use
- Benchmark accounting and public discoverability for the CPPB release surface.
- Template-disjoint train/dev/test selection for future detector or routing research.
- Evaluation protocol alignment with the BodhiPromptShield paper.
## Limitations
This package is a controlled benchmark manifest, not a full raw-prompt corpus. It should be interpreted as a benchmark-authored release card surface that preserves provenance, split semantics, and release boundaries. If you need end-to-end multimodal regeneration assets or licensed external benchmark inputs, use the repository protocols instead of this Hugging Face package.
## Citation
```bibtex
@article{ma2026bodhipromptshield,
title={BodhiPromptShield: Pre-Inference Prompt Mediation for Suppressing Privacy Propagation in LLM/VLM Agents},
author={Ma, Bo and Wu, Jinsong and Yan, Weiqi},
journal={arXiv preprint arXiv:2604.05793},
year={2026},
url={https://arxiv.org/abs/2604.05793}
}
```
## Repository
- GitHub: https://github.com/mabo1215/BodhiPromptShield
- Paper: https://arxiv.org/abs/2604.05793
- Release source files: `src/experiments/cppb_*`
---
language:
- 英语
pretty_name: 受控提示隐私基准测试(Controlled Prompt-Privacy Benchmark)
size_categories:
- 样本量小于1000
task_categories:
- 文本生成
- 文本分类
task_ids:
- 命名实体识别
- 文档问答
- 文本到文本生成
tags:
- 隐私
- 提示安全
- 去标识化
- 编辑脱敏
- 大语言模型智能体(LLM Agents)
- 评测
license: 其他
configs:
- config_name: 默认
data_files:
- split: 训练集
path: data/train.csv
- split: 验证集
path: data/dev.csv
- split: 测试集
path: data/test.csv
---
# 受控提示隐私基准测试(CPPB)
## 摘要
CPPB是论文《BodhiPromptShield: Pre-Inference Prompt Mediation for Suppressing Privacy Propagation in LLM/VLM Agents》(arXiv:2604.05793)中提出的受控提示隐私基准测试的公开发布界面。本Hugging Face包仅发布基准测试构建的提示清单与模板分层的训练/开发/测试划分,而非原始第三方提示、源图像或端到端光学字符识别(Optical Character Recognition, OCR)资源。每一行均为受控提示桩(prompt stub),附带基准测试元数据,支持可复现的统计核算、划分可审计性与基准测试可发现性。
## 发布内容
- 由32个模板×8个变体衍生而来的256条基准测试构建的提示清单行。
- 确定性模板不相交的训练/开发/测试划分:分别包含128、64、64行。
- 用于重构已发布基准测试卡片的提示族、隐私类别、子集、模态与来源字段。
- 仓库捆绑包中附带的发布说明与许可/来源清单。
## 未发布内容
- 原始用户提示或第三方提示负载。
- 原始OCR源资源、截图或扫描文档。
- 受许可的临床记录或私有运营日志。
- 超出已发布清单界面之外的精确多模态再生资源。
## 数据集结构
主要列项如下:
- `prompt_id`:唯一提示实例标识符。
- `template_id`:8个固定变体共享的模板标识符。
- `variant_id`:取值为`V1`至`V8`中的一个。
- `prompt_family`:分为直接请求、面向文档、检索式、面向工具的智能体四类。
- `prompt_source`:基准测试构建的源族。
- `downstream_task_type`:分为提示问答、文档问答、检索问答或智能体执行四类。
- `primary_privacy_category`:主要受保护内容类别。
- `subset`:分为必要隐私或附带隐私两类。
- `modality`:分为纯文本或OCR介导的文本加图像两类。
- `template_stub`:精简的模板级描述。
- `prompt_stub`:精简的提示实例描述。
- `split`:所属已发布划分。
## 预期用途
- CPPB发布界面的基准测试统计核算与公开可发现性。
- 面向未来检测器或路由研究的模板不相交训练/开发/测试集选择。
- 与BodhiPromptShield论文对齐的评测协议。
## 局限性
本包为受控基准测试清单,而非完整的原始提示语料库。其应被视为保留了来源、划分语义与发布边界的基准测试构建的发布卡片界面。若您需要端到端多模态再生资源或受许可的外部基准测试输入,请改用仓库协议,而非本Hugging Face包。
## 引用
bibtex
@article{ma2026bodhipromptshield,
title={BodhiPromptShield: Pre-Inference Prompt Mediation for Suppressing Privacy Propagation in LLM/VLM Agents},
author={Ma, Bo and Wu, Jinsong and Yan, Weiqi},
journal={arXiv preprint arXiv:2604.05793},
year={2026},
url={https://arxiv.org/abs/2604.05793}
}
## 仓库
- GitHub:https://github.com/mabo1215/BodhiPromptShield
- 论文:https://arxiv.org/abs/2604.05793
- 发布源文件:`src/experiments/cppb_*`
提供机构:
mabo1215



