Table 1_An explainable machine learning model for predicting chronic coronary disease and identifying valuable text features.xlsx

NIAID Data Ecosystem2026-05-10 收录

下载链接：

https://figshare.com/articles/dataset/Table_1_An_explainable_machine_learning_model_for_predicting_chronic_coronary_disease_and_identifying_valuable_text_features_xlsx/30176449

下载链接

链接失效反馈

官方服务：

资源简介：

BackgroundChronic Coronary Disease (CCD) is a leading global cause of morbidity and mortality. Existing Pre-test Probability (PTP) models mainly rely on in-hospital data and clinician judgment. This study aims to construct machine learning (ML) models for predicting CCD by using easily accessible text data and baseline characteristics, and to evaluate the contribution of text data to the diagnostic model. MethodsThe chief complaints, present illness, past medical history and vital signs of the patients from the internal medicine departments of the First Affiliated Hospital and the Second Affiliated Hospital of Wannan Medical College were gathered. The text data of the research subjects were structured by using text mining technology. A customized “stop words” list and “custom dictionary” for cardiovascular medicine were created to optimize the processing of text data. Then, ML algorithms were employed to establish CCD prediction models. Finally, the Shapley additive explanation (SHAP) algorithm was used to interpret the models. ResultsWe enrolled a total of 21,855 patients in this study, with 7,449 in the CCD group and 14,406 in the non-CCD group. Patients in the CCD group were generally older and had a higher male proportion. After conducting feature engineering, we successfully constructed a Random Forest model. The model achieved an area under the ROC curve (AUC) of 0.93 (95% CI, 0.93–0.94), demonstrating excellent performance in horizontal comparisons. Using the SHAP algorithm, valuable text features like “chest pain”, “chest tightness” and structured features such as age, which are crucial for CCD judgment, were identified. Additionally, an illustration of how these features influenced the model's decision-making process was provided. ConclusionClinicians can leverage text data to construct a prediction model for CCD and apply the SHAP approach to pinpoint valuable text features and elucidate the model's decision-making mechanism.

背景慢性冠状动脉疾病（Chronic Coronary Disease, CCD）是全球范围内引发发病与死亡的主要病因之一。现有预测试概率（Pre-test Probability, PTP）模型主要依赖院内数据与临床医师的主观判断。本研究旨在利用易于获取的文本数据与基线特征构建慢性冠状动脉疾病预测的机器学习（Machine Learning, ML）模型，并评估文本数据对该诊断模型的贡献价值。方法本研究收集了皖南医学院第一附属医院与第二附属医院内科患者的主诉、现病史、既往病史与生命体征数据。采用文本挖掘（text mining）技术对受试者的文本数据进行结构化处理；构建了针对心血管内科的定制化停用词（stop words）表与自定义词典（custom dictionary），以优化文本数据的处理流程。随后采用机器学习算法构建慢性冠状动脉疾病预测模型，最终通过Shapley可加解释（Shapley Additive Explanation, SHAP）算法对模型进行可解释性分析。结果本研究共纳入21855例患者，其中慢性冠状动脉疾病组7449例，非慢性冠状动脉疾病组14406例。慢性冠状动脉疾病组患者年龄普遍更高，男性占比也更高。完成特征工程后，本研究成功构建了随机森林（Random Forest）模型，该模型的受试者工作特征曲线下面积（Area Under the Receiver Operating Characteristic Curve, AUC）达0.93（95%置信区间（confidence interval, CI）：0.93~0.94），横向对比显示其性能优异。通过SHAP算法，本研究识别出对慢性冠状动脉疾病诊断具有关键价值的文本特征（如「胸痛」「胸闷」）与结构化特征（如年龄）。此外，本研究还可视化展示了上述特征对模型决策过程的影响机制。结论临床医师可借助文本数据构建慢性冠状动脉疾病预测模型，并通过SHAP方法精准定位关键文本特征，阐明模型的决策机制。

创建时间：

2025-09-22

5,000+

优质数据集

54 个

任务类型

进入经典数据集