“Big Data” Fast Chemoinformatics Model to Predict Generalized Born Radius and Solvent Accessibility as a Function of Geometry

Figshare2020-05-06 更新2026-04-28 收录

下载链接：

https://figshare.com/articles/dataset/_Big_Data_Fast_Chemoinformatics_Model_to_Predict_Generalized_Born_Radius_and_Solvent_Accessibility_as_a_Function_of_Geometry/12328394

下载链接

链接失效反馈

官方服务：

资源简介：

The Generalized Born (GB) solvent model is offering the best accuracy/computing effort ratio yet requires drastic simplifications to estimate of the Effective Born Radii (EBR) in bypassing a too expensive volume integration step. EBRs are a measure of the degree of burial of an atom and not very sensitive to small changes of geometry: in molecular dynamics, the costly EBR update procedure is not mandatory at every step. This work however aims at implementing a GB model into the Sampler for Multiple Protein–Ligand Entities (S4MPLE) evolutionary algorithm with mandatory EBR updates at each step triggering arbitrarily large geometric changes. Therefore, a quantitative structure–property relationship has been developed in order to express the EBRs as a linear function of both the topological neighborhood and geometric occupancy of the space around atoms. A training set of 810 molecular systems, starting from fragment-like to drug-like compounds, proteins, host–guest systems, and ligand–protein complexes, has been compiled. For each species, S4MPLE generated several hundreds of random conformers. For each atom in each geometry of each species, its “standard” EBR was calculated by numeric integration and associated to topological and geometric descriptors of the atom neighborhood. This training set (EBR, atom descriptors) involving >5 M entries was subjected to a boot-strapping multilinear regression process with descriptor selection. In parallel, the strategy was repurposed to also learn atomic solvent-accessible areas (SA) based on the same descriptors. Resulting linear equations were challenged to predict EBR and SA values for a similarly compiled external set of >2000 new molecular systems. Solvation energies calculated with estimated EBR and SA match “standard” energies within the typical error of a force-field-based approach (a few kilocalories per mole). Given the extreme diversity of molecular systems covered by the model, this simple EBR/SA estimator covers a vast applicability domain.

广义玻恩（Generalized Born，GB）溶剂模型拥有目前最优的精度与计算开销比，但为了绕过计算成本过高的体积积分步骤以估算有效玻恩半径（Effective Born Radii，EBR），仍需进行大幅简化。有效玻恩半径是衡量原子埋藏程度的指标，且对几何结构的微小变化并不敏感：在分子动力学中，成本高昂的EBR更新流程并非每一步都必须执行。然而本研究旨在将GB模型集成至多蛋白质-配体实体采样器（Sampler for Multiple Protein–Ligand Entities，S4MPLE）进化算法中，而该算法要求每一步都执行强制EBR更新，这会触发任意幅度的几何结构变化。为此，本研究构建了定量构效关系，将EBR表示为原子周围空间的拓扑邻域与几何占据度的线性函数。研究人员构建了包含810个分子体系的训练集，涵盖类片段、类药物化合物、蛋白质、主客体体系以及配体-蛋白质复合物。针对每个体系，S4MPLE生成了数百个随机构象。对于每个体系的每种几何结构中的每个原子，均通过数值积分计算其"标准"EBR，并与该原子邻域的拓扑与几何描述符相关联。这个包含（EBR、原子描述符）、总数据量超500万条的训练集，经过了带描述符筛选的自举多元线性回归处理。与此同时，该策略还被重新用于基于相同描述符学习原子溶剂可及表面积（Solvent-Accessible Surface Area，SA）。通过所得线性方程，研究人员对另一组独立编译的、包含超过2000个新分子体系的外部测试集进行EBR与SA值的预测。通过估算得到的EBR与SA计算得到的溶剂化能，与基于力场方法得到的"标准"溶剂化能的偏差，处于力场方法的典型误差区间内（约每摩尔数千卡）。鉴于该模型覆盖的分子体系种类极为多样，这款简易的EBR/SA估算器拥有极广的适用域。

创建时间：

2020-05-06

5,000+

优质数据集

54 个

任务类型

进入经典数据集