smoldocling-hospital-privacy/synthetic-hospital-doctags-1000
收藏Hugging Face2025-11-23 更新2025-11-30 收录
下载链接:
https://hf-mirror.com/datasets/smoldocling-hospital-privacy/synthetic-hospital-doctags-1000
下载链接
链接失效反馈官方服务:
资源简介:
SmolDocling医院隐私合成数据集是一个包含合成单页“出院摘要”文档的数据集,用于在医院环境中评估文档视觉语言模型的隐私泄露情况。数据集分为训练集1000页、验证集200页和测试集200页。每一页都是一个类似扫描的出院摘要,包含固定的表格(药物、实验室检查)。目标是用DocTags标记来描述结构和内容。在训练集的一部分patient_id字段中插入了特殊的CANARY-...标记。对于每个canary,使用了大约1000个诱饵字符串来进行暴露排名。所有数据都是合成的,没有使用真实患者数据。
SmolDocling Hospital Privacy Synthetic Dataset is a collection of synthetic single-page discharge summary documents used for evaluating privacy leakage in document vision-language models (SmolDocling) within a hospital setting. The dataset is divided into 1000 training pages, 200 validation pages, and 200 test pages. Each page consists of a scanned-like discharge summary with fixed tables (meds, labs). The targets are DocTags markup describing the structure and content. Special CANARY-... tokens are inserted in a subset of the train patient_id fields. For each canary, approximately 1000 decoy strings are used for exposure ranking. All data is synthetic, and no real patient data is used.
提供机构:
smoldocling-hospital-privacy



