Effect of Molecular Descriptors on the Development of Machine Learning Models for the Prediction of Yield Sooting Index

Figshare2026-04-28 收录

下载链接：

https://figshare.com/articles/dataset/Effect_of_Molecular_Descriptors_on_the_Development_of_Machine_Learning_Models_for_the_Prediction_of_Yield_Sooting_Index/30598230

下载链接

链接失效反馈

官方服务：

资源简介：

Sooting propensity is a critical property to estimate the combustion efficiency and pollution emissions of a fuel and also to discover the next generation of cleaner and more efficient fuels. Yield sooting index (YSI) is an important metric to characterize the sooting propensity; however, it is inefficient to measure this experimentally. Thus, the development of machine learning (ML)-based predictive models exists as an important instrument to predict the YSI for fuel design. Herein, this work compares the accuracies and interpretability of four ML models to predict the YSI based on different kinds of descriptors. It is demonstrated that the developed best ML models using different kinds of descriptors are different. The multilayer perceptron (MLP) regressor neural network (NN), gradient boosting (GB), and random forest (RF) models are the best models for the PaDEL, mordred, and quantum mechanical (QM) descriptors, respectively. The NN model is suitable for the combination of QM descriptors with full PaDEL and mordred descriptors, while the RF model is better for the combination of QM descriptors with PaDEL and mordred descriptors after the permutation feature importance (PFI) filtering procedure. The usage of QM descriptors can slightly improve the deep-learning-based ML model performance. The developed ML models can all predict the YSI with high accuracy, i.e., the coefficient of determination (R2) is close to 1.0, and the mean absolute error is less than 20 between the experimental data and prediction data for the training, valid, and test sets, respectively. Among the developed ML models, the GB model, by using the PFI-filtered mordred computed descriptors, exhibits the best performance. The present work is valuable for the selection of descriptors for the development of ML models to predict fuel properties.

5,000+

优质数据集

54 个

任务类型

进入经典数据集