Performance ML models.
收藏NIAID Data Ecosystem2026-05-10 收录
下载链接:
https://figshare.com/articles/dataset/Performance_ML_models_/30559131
下载链接
链接失效反馈官方服务:
资源简介:
Background: Cardiovascular diseases (CVD) are one of the leading global causes of death, which requires an accurate early prediction. This study aimed to develop transparent machine learning (ML) models using National Health and Nutrition Examination Survey (NHANES) data from 2017–2023 to predict CVD risk based on dietary and health factors.
Methods: We analyzed data from 12,382 adults (aged 18 and older) from NHANES 2017–2023, including 41 dietary, anthropometric, clinical, and demographic variables. Recursive Feature Elimination (RFE) was used to select an optimal subset of 30 predictors. To address substantial class imbalance in the outcome, we applied the Random Over-Sampling Examples (ROSE) technique to the training data. Five machine learning models—Logistic Regression, Random Forest, Support Vector Machines, XGBoost, and LightGBM—were trained and evaluated. Model interpretability was assessed using LIME and SHAP.
Results: Participants with CVD differed significantly from those without CVD in age, waist circumference, systolic blood pressure, C-reactive protein (CRP), and multiple dietary nutrients, with a consistently lower nutrient intake in the CVD group. Among the ML models evaluated, XGBoost achieved the highest accuracy (0.8216) and recall (0.8645), while Random Forest showed the highest AUROC (0.8139). Interpretability analyses identified age as the strongest predictor, followed by vitamin B12, total cholesterol, CRP, and waist circumference.
Conclusion: Interpretable ML models effectively identified key dietary and clinical factors for CVD risk. Nutrients like vitamin B12 and niacin, alongside established clinical indicators, emerged as significant predictors, underscoring their potential role in nutritional interventions and public health strategies for CVD prevention.
背景:心血管疾病(Cardiovascular diseases, CVD)是全球主要致死病因之一,亟需精准的早期预测手段。本研究旨在利用2017-2023年美国国家健康与营养调查(National Health and Nutrition Examination Survey, NHANES)数据,构建可解释的机器学习(Machine Learning, ML)模型,基于膳食与健康相关因素预测心血管疾病风险。方法:我们对2017-2023年NHANES中12382名年龄≥18岁的成年受试者数据展开分析,涵盖膳食、人体测量学、临床及人口统计学共41项变量。采用递归特征消除(Recursive Feature Elimination, RFE)筛选出30个最优预测变量子集。针对结局变量存在的严重类别不平衡问题,我们对训练数据集应用了随机过采样举例(Random Over-Sampling Examples, ROSE)技术。随后训练并评估了5种机器学习模型:逻辑回归、随机森林、支持向量机、XGBoost及LightGBM,并使用LIME和SHAP对模型可解释性进行评估。结果:患有心血管疾病的受试者与未患病受试者在年龄、腰围、收缩压、C反应蛋白(C-reactive protein, CRP)及多种膳食营养素摄入水平上均存在显著差异,且心血管疾病组的营养素摄入普遍更低。在所评估的机器学习模型中,XGBoost的准确率(0.8216)与召回率(0.8645)均为最高,而随机森林的曲线下面积(AUROC)最高,达0.8139。可解释性分析显示,年龄是最强的预测因子,其次为维生素B12、总胆固醇、CRP及腰围。结论:可解释机器学习模型有效识别出了心血管疾病风险相关的关键膳食与临床因素。维生素B12、烟酸等营养素,连同已确立的临床指标,均被证实为重要预测因子,这凸显了它们在心血管疾病预防的营养干预及公共卫生策略中的潜在应用价值。
创建时间:
2025-11-06



