Narcolepsy Risk Estimation from Clinical Notes
收藏DataCite Commons2026-03-03 更新2026-03-29 收录
下载链接:
https://bdsp.io/content/qsyoj1ut1t90zzl9ajpr/1.0/
下载链接
链接失效反馈官方服务:
资源简介:
Narcolepsy is a chronic neurological disorder that is often underdiagnosed and
subject to long diagnostic delays. We developed and validated machine learning
models to phenotype narcolepsy type 1 (NT1) and narcolepsy type 2/idiopathic
hypersomnia (NT2/IH) using electronic health record (EHR) data from five sites
within the Brain Data Science Platform (BDSP): Mass General Brigham, Beth
Israel Deaconess Medical Center, Boston Children's Hospital, Stanford
University, and Emory University. Clinical notes were manually annotated by
physician reviewers following a standardized protocol, and model features were
derived from ICD codes, medication orders, and natural language keyword
extraction. For cross-sectional classification, we trained logistic
regression, random forest, gradient boosting, and XGBoost models using nested
leave-one-site-out cross-validation. NT1 classification achieved mean AUROCs
of 0.991-0.994 and AUPRCs of 0.906-0.935; NT2/IH classification was more
challenging, with mean AUROCs of 0.967-0.984 and AUPRCs of 0.692-0.778. For
longitudinal prediction, we trained regularized logistic regression models
(SGD with L1 penalty) using cumulative NLP features from pre-diagnostic notes,
with a 6-month horizon exclusion to prevent learning from diagnostic-workup
features. Leave-one-site-out validation achieved AUROCs of 0.80 (any
narcolepsy) and 0.79 (NT1), enabling identification of at-risk patients prior
to clinical diagnosis. We here release the associated data and code to support
reproducible research in narcolepsy phenotyping from large-scale EHR data.
提供机构:
BDSP
创建时间:
2026-03-03



