five

Resolving Transition Metal Chemical Space: Feature Selection for Machine Learning and Structure–Property Relationships

收藏
NIAID Data Ecosystem2026-03-10 收录
下载链接:
https://figshare.com/articles/dataset/Resolving_Transition_Metal_Chemical_Space_Feature_Selection_for_Machine_Learning_and_Structure_Property_Relationships/5603020
下载链接
链接失效反馈
官方服务:
资源简介:
Machine learning (ML) of quantum mechanical properties shows promise for accelerating chemical discovery. For transition metal chemistry where accurate calculations are computationally costly and available training data sets are small, the molecular representation becomes a critical ingredient in ML model predictive accuracy. We introduce a series of revised autocorrelation functions (RACs) that encode relationships of the heuristic atomic properties (e.g., size, connectivity, and electronegativity) on a molecular graph. We alter the starting point, scope, and nature of the quantities evaluated in standard ACs to make these RACs amenable to inorganic chemistry. On an organic molecule set, we first demonstrate superior standard AC performance to other presently available topological descriptors for ML model training, with mean unsigned errors (MUEs) for atomization energies on set-aside test molecules as low as 6 kcal/mol. For inorganic chemistry, our RACs yield 1 kcal/mol ML MUEs on set-aside test molecules in spin-state splitting in comparison to 15–20× higher errors for feature sets that encode whole-molecule structural information. Systematic feature selection methods including univariate filtering, recursive feature elimination, and direct optimization (e.g., random forest and LASSO) are compared. Random-forest- or LASSO-selected subsets 4–5× smaller than the full RAC set produce sub- to 1 kcal/mol spin-splitting MUEs, with good transferability to metal–ligand bond length prediction (0.004–5 Å MUE) and redox potential on a smaller data set (0.2–0.3 eV MUE). Evaluation of feature selection results across property sets reveals the relative importance of local, electronic descriptors (e.g., electronegativity, atomic number) in spin-splitting and distal, steric effects in redox potential and bond lengths.

针对量子力学性质的机器学习(Machine Learning,ML)在加速化学发现方面颇具应用前景。在精确计算成本高昂且可用训练数据集规模有限的过渡金属化学领域,分子表征是影响ML模型预测精度的核心要素。我们提出了一系列改进型自相关函数(Revised Autocorrelation Functions,RACs),可对分子图上的启发式原子属性(如原子半径、连接性与电负性)之间的关联关系进行编码。我们对标准自相关函数(Standard Autocorrelation Functions,ACs)中待评估量的起始点、范围与本质进行了调整,使这些RACs能够适配无机化学研究场景。在有机分子数据集上,我们首先验证了相较于当前主流拓扑描述符,标准ACs用于ML模型训练时表现更优,预留测试分子的原子化能平均无符号误差(Mean Unsigned Errors,MUEs)最低可达6 kcal/mol。针对无机化学领域,我们的RACs在预留测试分子的自旋态分裂任务中实现了1 kcal/mol的ML模型MUE,而采用编码全分子结构信息的特征集时,误差为该结果的15~20倍。我们对多种系统性特征选择方法进行了对比,包括单变量过滤法、递归特征消除法以及直接优化法(如随机森林与LASSO)。从完整RAC集合中筛选出的、规模仅为其1/4~1/5的随机森林或LASSO特征子集,可实现亚1 kcal/mol至1 kcal/mol的自旋态分裂MUE,且在金属-配体键长预测(平均无符号误差为0.004~5 Å)与小数据集下的氧化还原电位预测(平均无符号误差为0.2~0.3 eV)中展现出良好的迁移性能。对不同属性数据集的特征选择结果进行评估后可知:在自旋态分裂任务中,局部电子描述符(如电负性、原子序数)的相对重要性更高;而在氧化还原电位与键长预测任务中,远端空间效应的作用更为关键。
创建时间:
2017-11-15
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作