Development of a machine learning-based ionization efficiency prediction model for per- and polyfluoroalkyl substances and its application in semi-quantitative analysis

中国科学数据2026-04-09 更新2026-04-25 收录

下载链接：

https://www.sciengine.com/AA/doi/10.3724/SP.J.1123.2025.02012

下载链接

链接失效反馈

官方服务：

资源简介：

Per- and polyfluoroalkyl substances （PFASs） represent a category of emerging contaminants of global concern in fields such as environmental science and food safety， due to their persistence， bioaccumulative properties and potential toxicity. Although screening methods for PFASs using high resolution mass spectrometry （HRMS） have been developed rapidly， the diversity of PFASs and the absence of standards pose significant challenges for quantitative analysis. In this study， 50 PFASs were analyzed by HPLC-HRMS. The ionization efficiency （IE） was calculated as the slope of the calibration curve. A quantitative structure-activity relationship （QSAR） model was developed employing machine learning to predict the ionization efficiencies of PFASs using PaDEL molecular descriptors. The model enables semi-quantitative estimation of PFASs concentrations in the absence of reference standards by incorporating predicted IE values. Eighteen critical descriptors were selected from a total of 1 444 PaDEL descriptors through the application of recursive feature elimination （RFE）. These selected descriptors encompassed topological descriptors， geometrical descriptors， autocorrelation descriptors， electrostatic and polarity descriptors. These individual descriptors including VE1_Dzv， GATS6i， JGI10， GATS1p and MATS4m were of great importance. Three algorithms including elastic net linear regression， random forest （RF）， and XGBoost were evaluated for model performance. In the elastic net linear regression model， the root mean square error （RMSE） for the training dataset was 0.049 0， and the coefficient of determination （R²） was 0.993 0； for the test dataset， the RMSE was 0.163 0， with an R² of 0.756 1. In the RF model， the RMSE for the training dataset was 0.163 1， and the R² was 0.921 9； for the test dataset， the RMSE was 0.131 6， with an R² of 0.840 9. In the XGBoost model， the RMSE for the training dataset was 0.052 1， and the R² was 0.992 0； for the test dataset， the RMSE was 0.118 4， with an R² of 0.871 3. Nonlinear algorithms of random forest and XGBoost demonstrated superior predictive performance compared to the elastic net linear regression， with XGBoost exhibiting best performance. Random forest， a bagging-based approach， trains individual decision trees independently and aggregates predictions through averaging. In contrast， XGBoost employs gradient boosting methodology， iteratively optimizing the model by sequentially training new trees in order to address residuals from previous iterations. The independent training mechanism of random forest inherently lacks the iterative optimization framework that is characteristic of gradient boosting. Specifically， XGBoost systematically enhances predictive accuracy by generating new trees that target residual errors from preceding models， thereby progressively refining predictive performance. This fundamental difference in optimization strategy enables XGBoost to more effectively correct prediction errors compared to the ability of random forest. Based on the results of a comprehensive evaluation of the three models， the XGBoost algorithm was ultimately selected for its demonstrated performance advantages. The prediction errors of ionization efficiency （IE） for the 50 PFASs were within 1.67-fold， with a median value of 1.04-fold and RMSE of 1.06. The established XGBoost model was further applied for the semi-quantitative concentration prediction of 50 PFASs across concentration gradients， where the prediction errors ranged from 0.12 to 4.90-fold， with a median value of 0.96-fold and RMSE of 0.94. The accuracy of the prediction improved as the concentrations increased. Furthermore， the model was applied to predict concentrations of PFASs in fish tissue. After sample extraction and cleanup using solid-phase extraction， the samples were analyzed using HPLC-HRMS. The concentrations of PFASs were semi-quantified using the predicted IEs， yielding prediction errors ranging from 0.79-fold to 1.81-fold. These findings highlight the robustness of the IE prediction model for PFASs. Notably， the performance of the developed model was better than or comparable to the performance of previous studies. In conclusion， this study introduces a machine learning-based QSAR model for the prediction of ionization efficiency. This approach illustrates the ability to estimate the concentrations of PFASs in the absence of standards， thereby presenting considerable potential for the risk assessment of compounds lacking standards in suspect and non-targeted screening.

创建时间：

2026-04-09

5,000+

优质数据集

54 个

任务类型

进入经典数据集