Toward Reliable Conformational Energies of Amino Acids and DipeptidesThe DipCONFS Benchmark and DipCONL Datasets
收藏NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://figshare.com/articles/dataset/Toward_Reliable_Conformational_Energies_of_Amino_Acids_and_Dipeptides_The_DipCONFS_Benchmark_and_DipCONL_Datasets/26997574
下载链接
链接失效反馈官方服务:
资源简介:
Simulating peptides
and proteins is becoming increasingly
important,
leading to a growing need for efficient computational methods. These
are typically semiempirical quantum mechanical (SQM) methods, force
fields (FFs), or machine-learned interatomic potentials (MLIPs), all
of which require a large amount of accurate data for robust training
and evaluation. To assess potential reference methods and complement
the available data, we introduce two sets, DipCONFL and DipCONFS,
which cover large parts of the conformational space of 17 amino acids
and their 289 possible dipeptides in aqueous solution. The conformers
were selected from the exhaustive PeptideCS dataset by Andris et al.
[J. Phys. Chem. B 2022, 126, 5949–5958]. The structures, originally generated with
GFN2-xTB, were reoptimized using the accurate r2SCAN-3c
density functional theory (DFT) composite method including the implicit
CPCM water solvation model. The DipCONFS benchmark set contains 918
conformers and is one of the largest sets with highly accurate coupled
cluster conformational energies so far. It is employed to evaluate
various DFT and wave function theory (WFT) methods, especially regarding
whether they are accurate enough to be used as reliable reference
methods for larger datasets intended for training and testing more
approximated SQM, FF, and MLIP methods. The results reveal that the
originally provided BP86-D3(BJ)/DGauss-DZVP conformational energies
are not sufficiently accurate. Among the DFT methods tested as an
alternative reference level, the revDSD-PBEP86-D4 double hybrid performs
best with a mean absolute error (MAD) of 0.2 kcal mol–1 compared with the PNO-LCCSD(T)-F12b reference. The very efficient
r2SCAN-3c composite method also shows excellent results,
with an MAD of 0.3 kcal mol–1, similar to the best-tested
hybrid ωB97M-D4. With these findings, we compiled the large
DipCONFL set, which includes over 29,000 realistic conformers in solution
with reasonably accurate r2SCAN-3c reference conformational
energies, gradients, and further properties potentially relevant for
training MLIP methods. This set, also in comparison to DipCONFS, is
used to assess the performance of various SQM, FF, and MLIP methods
robustly and can complement training sets for those.
模拟肽与蛋白质的工作愈发重要,由此对高效计算方法的需求也与日俱增。当前主流的计算方法包括半经验量子力学(semiempirical quantum mechanical, SQM)方法、力场(force fields, FFs)以及机器学习原子间势(machine-learned interatomic potentials, MLIPs),上述方法均需要大量精准数据以实现稳健的训练与评估。
为评估潜在的参考方法并补充现有数据集,我们构建了DipCONFL与DipCONFS两个数据集,二者覆盖了17种氨基酸及其在水溶液中全部289种可能二肽的构象空间的绝大部分区域。
这些构象体选自Andris等人发布的穷尽式PeptideCS数据集[J. Phys. Chem. B 2022, 126, 5949–5958]。
最初由GFN2-xTB生成的结构,采用包含隐式CPCM水溶剂化模型的高精度r2SCAN-3c密度泛函理论(density functional theory, DFT)复合方法进行了重新优化。
DipCONFS基准数据集包含918个构象体,是目前已发布的包含高精度耦合簇构象能量的规模最大的数据集之一。
该数据集被用于评估各类DFT与波函数理论(wave function theory, WFT)方法的性能,尤其用于验证这些方法是否具备足够精度,可作为更大规模数据集的可靠参考方法,以用于训练和测试近似程度更高的SQM、FF及MLIP方法。
研究结果表明,原始数据集所提供的BP86-D3(BJ)/DGauss-DZVP构象能量精度不足。
在作为替代参考能级测试的DFT方法中,revDSD-PBEP86-D4双杂化泛函表现最优,与PNO-LCCSD(T)-F12b参考值相比,其平均绝对误差(mean absolute error, MAD)仅为0.2 kcal mol⁻¹。
高效性突出的r2SCAN-3c复合方法同样表现优异,平均绝对误差为0.3 kcal mol⁻¹,性能与表现最佳的杂化泛函ωB97M-D4相近。
基于上述研究结果,我们构建了大规模DipCONFL数据集,该数据集包含超过29000个溶液环境下的真实构象体,并附带精度合理的r2SCAN-3c参考构象能量、梯度以及其他可用于训练MLIP方法的相关性质。
该数据集可与DipCONFS一同用于稳健评估各类SQM、FF及MLIP方法的性能,同时可作为对应方法训练集的补充数据集。
创建时间:
2024-09-11



