Data Sheet 2_Research on the optimization model of anti-breast cancer candidate drugs based on machine learning.zip

NIAID Data Ecosystem2026-05-02 收录

下载链接：

https://figshare.com/articles/dataset/Data_Sheet_2_Research_on_the_optimization_model_of_anti-breast_cancer_candidate_drugs_based_on_machine_learning_zip/28767686

下载链接

链接失效反馈

官方服务：

资源简介：

Breast cancer is one of the most common malignancies among women globally, with its incidence rate continuously increasing, posing a serious threat to women’s health. Although current treatments, such as drugs targeting estrogen receptor alpha (ERα), have extended patient survival, issues such as drug resistance and severe side effects remain widespread. This study proposes a machine learning-based optimization model for anti-breast cancer candidate drugs, aimed at enhancing biological activity and optimizing ADMET (Absorption, Distribution, Metabolism, Excretion, Toxicity) properties through multi-objective optimization. Initially, grey relational analysis and Spearman correlation analysis were performed on the molecular descriptors of 1,974 compounds, identifying 91 key descriptors. A Random Forest model combined with Shapley Additive Explanations (SHAP) values was then used to further select the top 20 descriptors with the greatest impact on biological activity. The constructed Quantitative Structure-Activity Relationship (QSAR) model, using algorithms such as LightGBM, Random Forest, and XGBoost, achieved an R2 value of 0.743 for biological activity prediction, demonstrating strong predictive performance. Additionally, a multi-model fusion strategy and Particle Swarm Optimization (PSO) algorithm were employed to optimize both biological activity and ADMET properties, thereby improving the prediction of Caco-2, CYP3A4, hERG, HOB, and MN properties. For example, the best model for predicting Caco-2 achieved an F1 score of 0.8905, while the model for predicting CYP3A4 reached an F1 score of 0.9733. This multi-objective optimization model provides a novel and efficient tool for drug development, offering significant improvements in both biological activity and pharmacokinetic properties, with practical implications for the optimization of future anti-breast cancer drugs.

乳腺癌是全球女性最常见的恶性肿瘤之一，发病率持续攀升，对女性健康构成严重威胁。尽管当前的治疗手段，如靶向雌激素受体α（ERα）的药物，已延长了患者的生存期，但耐药性与严重不良反应等问题仍普遍存在。本研究提出了一种基于机器学习的抗乳腺癌候选药物优化模型，旨在通过多目标优化提升化合物的生物活性并优化ADMET（吸收、分布、代谢、排泄、毒性）属性。首先对1974种化合物的分子描述符开展灰色关联分析与斯皮尔曼相关性分析，筛选得到91个关键描述符。随后结合随机森林模型与夏普利可加解释（Shapley Additive Explanations, SHAP）值，进一步筛选出对生物活性影响最为显著的前20个描述符。本研究构建了基于LightGBM、随机森林、XGBoost等算法的定量构效关系（Quantitative Structure-Activity Relationship, QSAR）模型，其生物活性预测的决定系数R²达0.743，展现出优异的预测性能。此外，研究采用多模型融合策略与粒子群优化（Particle Swarm Optimization, PSO）算法，同时对生物活性与ADMET属性进行优化，进而提升了对Caco-2、CYP3A4、hERG、HOB及MN属性的预测效果。例如，预测Caco-2的最优模型F1分数达0.8905，而预测CYP3A4的模型F1分数可达0.9733。该多目标优化模型为药物研发提供了一种新颖高效的工具，在生物活性与药代动力学属性方面均实现了显著提升，对未来抗乳腺癌药物的优化具有重要的实际应用价值。

创建时间：

2025-04-10