mabo1215/CPPB

Name: mabo1215/CPPB
Creator: mabo1215
Published: 2026-04-08 13:07:22
License: 暂无描述

Hugging Face2026-04-08 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/mabo1215/CPPB

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en pretty_name: Controlled Prompt-Privacy Benchmark size_categories: - n<1K task_categories: - text-generation - text-classification task_ids: - named-entity-recognition - document-question-answering - text2text-generation tags: - privacy - prompt-security - de-identification - redaction - llm-agents - evaluation license: other configs: - config_name: default data_files: - split: train path: data/train.csv - split: validation path: data/dev.csv - split: test path: data/test.csv --- # CPPB ## Summary CPPB is the public release surface for the Controlled Prompt-Privacy Benchmark introduced in [BodhiPromptShield: Pre-Inference Prompt Mediation for Suppressing Privacy Propagation in LLM/VLM Agents](https://arxiv.org/abs/2604.05793). This Hugging Face package intentionally releases the benchmark-authored prompt manifest and template-stratified train/dev/test split, not raw third-party prompts, source images, or end-to-end OCR assets. Each row is a controlled prompt stub with benchmark metadata that supports reproducible accounting, split-auditability, and benchmark discoverability. ## What Is Released - 256 benchmark-authored prompt-manifest rows derived from 32 templates x 8 variants. - Deterministic template-disjoint `train` / `dev` / `test` split: 128 / 64 / 64 rows. - Prompt-family, privacy-category, subset, modality, and provenance fields needed to reconstruct the released benchmark card. - Companion release notes and licensing/provenance manifests in the repository bundle. ## What Is Not Released - Raw user prompts or third-party prompt payloads. - Original OCR source assets, screenshots, or scanned documents. - Licensed clinical notes or private operational logs. - Exact multimodal regeneration assets beyond the released manifest surface. ## Dataset Structure Main columns: - `prompt_id`: unique prompt instance identifier. - `template_id`: template identifier shared across the eight fixed variants. - `variant_id`: one of `V1`-`V8`. - `prompt_family`: one of Direct requests, Document-oriented, Retrieval-style, Tool-oriented agent. - `prompt_source`: benchmark-authored source family. - `downstream_task_type`: Prompt QA, Document QA, Retrieval QA, or Agent execution. - `primary_privacy_category`: dominant protected-content category. - `subset`: Essential-privacy or Incidental-privacy. - `modality`: Text-only or OCR-mediated text-plus-image. - `template_stub`: compact template-level description. - `prompt_stub`: compact prompt-instance description. - `split`: released split membership. ## Intended Use - Benchmark accounting and public discoverability for the CPPB release surface. - Template-disjoint train/dev/test selection for future detector or routing research. - Evaluation protocol alignment with the BodhiPromptShield paper. ## Limitations This package is a controlled benchmark manifest, not a full raw-prompt corpus. It should be interpreted as a benchmark-authored release card surface that preserves provenance, split semantics, and release boundaries. If you need end-to-end multimodal regeneration assets or licensed external benchmark inputs, use the repository protocols instead of this Hugging Face package. ## Citation ```bibtex @article{ma2026bodhipromptshield, title={BodhiPromptShield: Pre-Inference Prompt Mediation for Suppressing Privacy Propagation in LLM/VLM Agents}, author={Ma, Bo and Wu, Jinsong and Yan, Weiqi}, journal={arXiv preprint arXiv:2604.05793}, year={2026}, url={https://arxiv.org/abs/2604.05793} } ``` ## Repository - GitHub: https://github.com/mabo1215/BodhiPromptShield - Paper: https://arxiv.org/abs/2604.05793 - Release source files: `src/experiments/cppb_*`

--- language: - 英语 pretty_name: 受控提示隐私基准测试（Controlled Prompt-Privacy Benchmark） size_categories: - 样本量小于1000 task_categories: - 文本生成 - 文本分类 task_ids: - 命名实体识别 - 文档问答 - 文本到文本生成 tags: - 隐私 - 提示安全 - 去标识化 - 编辑脱敏 - 大语言模型智能体（LLM Agents） - 评测 license: 其他 configs: - config_name: 默认 data_files: - split: 训练集 path: data/train.csv - split: 验证集 path: data/dev.csv - split: 测试集 path: data/test.csv --- # 受控提示隐私基准测试（CPPB） ## 摘要 CPPB是论文《BodhiPromptShield: Pre-Inference Prompt Mediation for Suppressing Privacy Propagation in LLM/VLM Agents》（arXiv:2604.05793）中提出的受控提示隐私基准测试的公开发布界面。本Hugging Face包仅发布基准测试构建的提示清单与模板分层的训练/开发/测试划分，而非原始第三方提示、源图像或端到端光学字符识别（Optical Character Recognition, OCR）资源。每一行均为受控提示桩（prompt stub），附带基准测试元数据，支持可复现的统计核算、划分可审计性与基准测试可发现性。 ## 发布内容 - 由32个模板×8个变体衍生而来的256条基准测试构建的提示清单行。 - 确定性模板不相交的训练/开发/测试划分：分别包含128、64、64行。 - 用于重构已发布基准测试卡片的提示族、隐私类别、子集、模态与来源字段。 - 仓库捆绑包中附带的发布说明与许可/来源清单。 ## 未发布内容 - 原始用户提示或第三方提示负载。 - 原始OCR源资源、截图或扫描文档。 - 受许可的临床记录或私有运营日志。 - 超出已发布清单界面之外的精确多模态再生资源。 ## 数据集结构主要列项如下： - `prompt_id`：唯一提示实例标识符。 - `template_id`：8个固定变体共享的模板标识符。 - `variant_id`：取值为`V1`至`V8`中的一个。 - `prompt_family`：分为直接请求、面向文档、检索式、面向工具的智能体四类。 - `prompt_source`：基准测试构建的源族。 - `downstream_task_type`：分为提示问答、文档问答、检索问答或智能体执行四类。 - `primary_privacy_category`：主要受保护内容类别。 - `subset`：分为必要隐私或附带隐私两类。 - `modality`：分为纯文本或OCR介导的文本加图像两类。 - `template_stub`：精简的模板级描述。 - `prompt_stub`：精简的提示实例描述。 - `split`：所属已发布划分。 ## 预期用途 - CPPB发布界面的基准测试统计核算与公开可发现性。 - 面向未来检测器或路由研究的模板不相交训练/开发/测试集选择。 - 与BodhiPromptShield论文对齐的评测协议。 ## 局限性本包为受控基准测试清单，而非完整的原始提示语料库。其应被视为保留了来源、划分语义与发布边界的基准测试构建的发布卡片界面。若您需要端到端多模态再生资源或受许可的外部基准测试输入，请改用仓库协议，而非本Hugging Face包。 ## 引用 bibtex @article{ma2026bodhipromptshield, title={BodhiPromptShield: Pre-Inference Prompt Mediation for Suppressing Privacy Propagation in LLM/VLM Agents}, author={Ma, Bo and Wu, Jinsong and Yan, Weiqi}, journal={arXiv preprint arXiv:2604.05793}, year={2026}, url={https://arxiv.org/abs/2604.05793} } ## 仓库 - GitHub：https://github.com/mabo1215/BodhiPromptShield - 论文：https://arxiv.org/abs/2604.05793 - 发布源文件：`src/experiments/cppb_*`

提供机构：

mabo1215

5,000+

优质数据集

54 个

任务类型

进入经典数据集