Shareable Artificial Intelligence to Extract Cancer Outcomes from Electronic Health Records for Precision Oncology Research

Name: Shareable Artificial Intelligence to Extract Cancer Outcomes from Electronic Health Records for Precision Oncology Research
Creator: PhysioNet
Published: 2024-10-24 15:10:24
License: 暂无描述

DataCite Commons2024-10-24 更新2025-04-16 收录

下载链接：

https://physionet.org/content/dfci-cancer-outcomes-ehr/1.0.0/

下载链接

链接失效反馈

官方服务：

资源简介：

Databases that link molecular data to clinical outcomes can inform precision cancer research into novel prognostic and predictive biomarkers. However, outside of clinical trials, cancer outcomes are typically recorded only in text form within electronic health records (EHRs). Artificial intelligence (AI) models have been trained to extract outcomes from individual EHRs. However, patient privacy restrictions have historically precluded dissemination of these models beyond the centers at which they were trained. In this study, the vulnerability of text classification models trained directly on protected health information to membership inference attacks was confirmed. A teacher-student distillation approach was applied to develop shareable models for annotating outcomes from imaging reports and medical oncologist notes. 'Teacher' models trained on EHR data from Dana-Farber Cancer Institute (DFCI) were used to label imaging reports and discharge summaries from the Medical Information Mart for Intensive Care (MIMIC)-IV dataset. 'Student' models were trained to use these MIMIC documents to predict the labels assigned by teacher models and sent to Memorial Sloan Kettering (MSK) for evaluation. The student models exhibited high discrimination across outcomes in both the DFCI and MSK test sets. These student models, "DFCI- imaging-student" and "DFCI-medonc-student," are shared here.

能够将分子数据与临床结局相关联的数据库，可为针对新型预后与预测生物标志物的精准癌症研究提供参考依据。然而，在临床试验范畴之外，癌症结局通常仅以文本形式存储于电子健康记录（Electronic Health Records, EHRs）中。已有研究通过训练人工智能（Artificial Intelligence, AI）模型，从单条电子健康记录中提取临床结局信息，但受患者隐私保护限制，此类模型历来无法在训练其的研究中心之外进行分发共享。本研究证实，直接基于受保护健康信息训练的文本分类模型，存在遭受成员推理攻击（membership inference attacks）的脆弱性。我们采用师生蒸馏（teacher-student distillation）方法，开发了可共享的模型，用于标注影像报告与肿瘤内科医师病历笔记中的临床结局。研究以丹娜法伯癌症研究院（Dana-Farber Cancer Institute, DFCI）的电子健康记录数据训练‘教师’模型，并用其对重症医学信息库第四版（Medical Information Mart for Intensive Care-IV, MIMIC-IV）数据集内的影像报告与出院小结进行标注。随后训练‘学生’模型，利用上述重症医学信息库的文档复刻教师模型生成的标注结果，并将其送至纪念斯隆凯特琳癌症中心（Memorial Sloan Kettering, MSK）开展评估。在丹娜法伯癌症研究院与纪念斯隆凯特琳癌症中心的测试集上，学生模型对各类临床结局均展现出优异的判别性能。本研究共享两款学生模型："DFCI-imaging-student"与"DFCI-medonc-student"。

提供机构：

PhysioNet

创建时间：

2024-10-15

5,000+

优质数据集

54 个

任务类型

进入经典数据集