smoldocling-hospital-privacy/synthetic-hospital-doctags-250
收藏Hugging Face2025-11-23 更新2025-11-30 收录
下载链接:
https://hf-mirror.com/datasets/smoldocling-hospital-privacy/synthetic-hospital-doctags-250
下载链接
链接失效反馈官方服务:
资源简介:
SmolDocling医院隐私合成数据集包含合成的单页“出院摘要”文档,用于在医院的设置中评估文档视觉语言模型的隐私泄露。该数据集分为训练集、验证集和测试集,每个集合包含不同数量的页面,每个页面都包含类似扫描的出院摘要表格。数据集的目标是使用DocTags标记描述结构和内容。此外,数据集还包含Canary tokens和Decoy strings用于成员推断、Canary暴露和不同的私有LoRA微调研究。
The SmolDocling Hospital Privacy Synthetic Dataset contains synthetic single-page discharge summary documents for evaluating privacy leakage in document vision-language models in a hospital setting. The dataset is divided into training, validation, and test sets with different numbers of pages, each containing a scanned-like discharge summary with tables. The goal of the dataset is to use DocTags markup to describe the structure and content. Additionally, the dataset includes Canary tokens and Decoy strings for research on membership inference, Canary exposure, and differentially private LoRA fine-tuning.
提供机构:
smoldocling-hospital-privacy



