Table_4_T4SE-XGB: Interpretable Sequence-Based Prediction of Type IV Secreted Effectors Using eXtreme Gradient Boosting Algorithm.csv
收藏NIAID Data Ecosystem2026-03-12 收录
下载链接:
https://figshare.com/articles/dataset/Table_4_T4SE-XGB_Interpretable_Sequence-Based_Prediction_of_Type_IV_Secreted_Effectors_Using_eXtreme_Gradient_Boosting_Algorithm_csv/12997793
下载链接
链接失效反馈官方服务:
资源简介:
Type IV secreted effectors (T4SEs) can be translocated into the cytosol of host cells via type IV secretion system (T4SS) and cause diseases. However, experimental approaches to identify T4SEs are time- and resource-consuming, and the existing computational tools based on machine learning techniques have some obvious limitations such as the lack of interpretability in the prediction models. In this study, we proposed a new model, T4SE-XGB, which uses the eXtreme gradient boosting (XGBoost) algorithm for accurate identification of type IV effectors based on optimal features based on protein sequences. After trying 20 different types of features, the best performance was achieved when all features were fed into XGBoost by the 5-fold cross validation in comparison with other machine learning methods. Then, the ReliefF algorithm was adopted to get the optimal feature set on our dataset, which further improved the model performance. T4SE-XGB exhibited highest predictive performance on the independent test set and outperformed other published prediction tools. Furthermore, the SHAP method was used to interpret the contribution of features to model predictions. The identification of key features can contribute to improved understanding of multifactorial contributors to host-pathogen interactions and bacterial pathogenesis. In addition to type IV effector prediction, we believe that the proposed framework can provide instructive guidance for similar studies to construct prediction methods on related biological problems. The data and source code of this study can be freely accessed at https://github.com/CT001002/T4SE-XGB.
第四型分泌效应蛋白(Type IV secreted effectors, T4SEs)可通过第四型分泌系统(Type IV secretion system, T4SS)转运至宿主细胞胞浆并引发疾病。然而,采用实验方法鉴定T4SEs往往耗时耗力,而现有基于机器学习技术开发的计算预测工具亦存在诸多明显局限,例如预测模型缺乏可解释性。本研究提出一种全新建模框架T4SE-XGB,该框架基于蛋白质序列的最优特征集合,利用极限梯度提升(eXtreme gradient boosting, XGBoost)算法实现第四型分泌效应蛋白的精准鉴定。在测试了20种不同类型的特征后,相较于其他机器学习方法,将全部特征输入XGBoost模型并通过5折交叉验证(5-fold cross validation)时,模型取得了最优性能。随后,本研究采用ReliefF算法(ReliefF algorithm)在自建数据集上筛选得到最优特征子集,进一步提升了模型的预测性能。在独立测试集上,T4SE-XGB展现出最优的预测性能,优于其他已发表的同类预测工具。此外,本研究借助SHAP方法(SHAP method)解析各特征对模型预测的贡献度,关键特征的识别有助于深化对宿主-病原体互作及细菌致病机制多因素调控过程的理解。除第四型分泌效应蛋白预测任务外,本研究所提出的建模框架可为其他针对相关生物学问题构建预测方法的同类研究提供极具参考价值的指导思路。本研究的相关数据与源代码可通过https://github.com/CT001002/T4SE-XGB免费获取。
创建时间:
2020-09-24



