smoldocling-hospital-privacy/synthetic-hospital-doctags-1000

Name: smoldocling-hospital-privacy/synthetic-hospital-doctags-1000
Creator: smoldocling-hospital-privacy
Published: 2025-11-23 17:13:39
License: 暂无描述

Hugging Face2025-11-23 更新2025-11-30 收录

下载链接：

https://hf-mirror.com/datasets/smoldocling-hospital-privacy/synthetic-hospital-doctags-1000

下载链接

链接失效反馈

官方服务：

资源简介：

SmolDocling医院隐私合成数据集是一个包含合成单页“出院摘要”文档的数据集，用于在医院环境中评估文档视觉语言模型的隐私泄露情况。数据集分为训练集1000页、验证集200页和测试集200页。每一页都是一个类似扫描的出院摘要，包含固定的表格（药物、实验室检查）。目标是用DocTags标记来描述结构和内容。在训练集的一部分patient_id字段中插入了特殊的CANARY-...标记。对于每个canary，使用了大约1000个诱饵字符串来进行暴露排名。所有数据都是合成的，没有使用真实患者数据。

SmolDocling Hospital Privacy Synthetic Dataset is a collection of synthetic single-page discharge summary documents used for evaluating privacy leakage in document vision-language models (SmolDocling) within a hospital setting. The dataset is divided into 1000 training pages, 200 validation pages, and 200 test pages. Each page consists of a scanned-like discharge summary with fixed tables (meds, labs). The targets are DocTags markup describing the structure and content. Special CANARY-... tokens are inserted in a subset of the train patient_id fields. For each canary, approximately 1000 decoy strings are used for exposure ranking. All data is synthetic, and no real patient data is used.

提供机构：

smoldocling-hospital-privacy