Benchmarking Machine Learning Models for Polymer Informatics: An Example of Glass Transition Temperature

Figshare2021-10-18 更新2026-04-28 收录

下载链接：

https://figshare.com/articles/dataset/Benchmarking_Machine_Learning_Models_for_Polymer_Informatics_An_Example_of_Glass_Transition_Temperature/16828463

下载链接

链接失效反馈

官方服务：

资源简介：

In the field of polymer informatics, utilizing machine learning (ML) techniques to evaluate the glass transition temperature Tg and other properties of polymers has attracted extensive attention. This data-centric approach is much more efficient and practical than the laborious experimental measurements when encountered a daunting number of polymer structures. Various ML models are demonstrated to perform well for Tg prediction. Nevertheless, they are trained on different data sets, using different structure representations, and based on different feature engineering methods. Thus, the critical question arises on selecting a proper ML model to better handle the Tg prediction with generalization ability. To provide a fair comparison of different ML techniques and examine the key factors that affect the model performance, we carry out a systematic benchmark study by compiling 79 different ML models and training them on a large and diverse data set. The three major components in setting up an ML model are structure representations, feature representations, and ML algorithms. In terms of polymer structure representation, we consider the polymer monomer, repeat unit, and oligomer with longer chain structure. Based on that feature, representation is calculated, including Morgan fingerprinting with or without substructure frequency, RDKit descriptors, molecular embedding, molecular graph, etc. Afterward, the obtained feature input is trained using different ML algorithms, such as deep neural networks, convolutional neural networks, random forest, support vector machine, LASSO regression, and Gaussian process regression. We evaluate the performance of these ML models using a holdout test set and an extra unlabeled data set from high-throughput molecular dynamics simulation. The ML model’s generalization ability on an unlabeled data set is especially focused, and the model’s sensitivity to topology and the molecular weight of polymers is also taken into consideration. This benchmark study provides not only a guideline for the Tg prediction task but also a useful reference for other polymer informatics tasks.

在聚合物信息学领域，利用机器学习（Machine Learning, ML）技术评估聚合物的玻璃化转变温度（Tg）及其他性能已受到广泛关注。当面对海量聚合物结构时，这种以数据为中心的研究范式远比耗时费力的实验测量高效实用。已有多种机器学习模型被证明在Tg预测任务中表现优异，但这些模型均基于不同的数据集、采用各异的结构表征方式与特征工程方法构建。因此，如何选择适配的机器学习模型以实现具备泛化能力的Tg预测，成为亟待解决的关键问题。为公平对比不同机器学习技术的性能表现，并探究影响模型性能的核心因素，我们通过整合79种不同的机器学习模型，并在大规模多样化数据集上开展训练，完成了一项系统性基准测试研究（benchmark study）。构建机器学习模型的三大核心要素分别为结构表征、特征表征与机器学习算法。在聚合物结构表征维度，我们涵盖了聚合物单体、重复单元以及长链低聚物三类结构。基于上述结构，我们计算得到多种特征表征方式，包括有无子结构频率的摩根指纹（Morgan fingerprinting）、RDKit描述符（RDKit descriptors）、分子嵌入（molecular embedding）、分子图（molecular graph）等。随后，我们采用不同的机器学习算法对所得特征输入进行训练，例如深度神经网络（deep neural networks）、卷积神经网络（convolutional neural networks）、随机森林（random forest）、支持向量机（support vector machine）、LASSO回归（LASSO regression）以及高斯过程回归（Gaussian process regression）。我们通过留出测试集（holdout test set）以及一组源自高通量分子动力学模拟（high-throughput molecular dynamics simulation）的额外未标记数据集，对上述机器学习模型的性能进行评估。本研究重点关注模型在未标记数据集上的泛化能力，同时也将聚合物拓扑结构（topology）与分子量（molecular weight）对模型的敏感性纳入考量范畴。本基准测试研究不仅为Tg预测任务提供了权威参考指南，也可为其他聚合物信息学相关研究任务提供有益借鉴。

创建时间：

2021-10-18

5,000+

优质数据集

54 个

任务类型

进入经典数据集