five

Graph-based deep learning models forthermodynamic property prediction: Theinterplay between target definition, datadistribution, featurization, and modelarchitecture

收藏
Figshare2024-10-30 更新2026-04-08 收录
下载链接:
https://figshare.com/articles/dataset/Graph-based_deep_learning_models_forthermodynamic_property_prediction_Theinterplay_between_target_definition_datadistribution_featurization_and_modelarchitecture/27262947/1
下载链接
链接失效反馈
官方服务:
资源简介:
This folder contains formation energy of BDE-db,QM9,PC9,QMugs and QMugs1.1 datasets by filtered (The training, test, and validation sets were randomly split in a ratio of 0.8, 0.1, and 0.1, respectively.). the filtered process are as follows and code can be found in https://github.com/chimie-paristech-CTM/thermo_GNN:(1) For the QM9 dataset, a cleaned version of the dataset in terms of SMILES was readilyavailable, in which 6,993 compounds from the original dataset were rejected. We adopted this version of the dataset.<br>(2) Extensive data cleaning had also been performed during the construction of the BDE-db dataset, and consequently, no additional cleaning steps were applied to this set either.<br>(3) For the PC9 dataset, we noticed that a significant fraction of the compounds were assigned incorrect SMILES. To clean up this dataset, we started by performing four filtering steps. First, we rejected SMILES strings that could not be parsed by RDKit. Secondly, we removed all compounds for which the number of atoms in the RDKit mol-object did not match the number of atoms in the .xyz-file. Thirdly, we removed all compounds for which the number of radical sites did not match the multiplicity assigned by the original authors of the dataset. Finally, we also removed the diatomic molecules from the dataset. 86,384 compounds passed all of these filtering steps. For the rejected compounds, we tried to update the SMILES strings starting from the .xyz-files with the help of xyz2mol. This resulted in the recovery of 10,265 additional data points. As such, the final filtered version of the dataset contains 96,634 compounds, of which 5078 are mono- and diradicals. For QMugs and QMugs1.1, no issues with the quality of the SMILES strings were detected. Nevertheless, in QMugs1.1, 767 heavily charged compounds, i.e., +/-3 charge or more, and 12 mono- and diatomic molecules were filtered out, as we can reasonably expect none of our model architectures to work for these compounds. Additionally, outliers are detected by regressing the computed energy values against the element counts (including explicit Hs) as independent variables. Residuals exceeding 1.5 times the interquartile range (IQR) from the upper quartile are subsequently removed, since such discrepancies between the linear model baseline and the actual value indicate exceptionally high enthalpies for the given molecular composition of the molecule, implying convergence to a particularly unstable conformation.(4) For the QMugs dataset, the constructed linear model in an RMSE of 0.0567 Hartree, and an MAE of 0.0428 Hartree;(5) For the QMugs1.1, an RMSE of 0.1199 Hartree, and an MAE of 0.0899 Hartree was obtained. The scatter plots for each dataset in Figure 1 illustrate the relationship between the predicted values (Y-axis) and the corresponding absolute enthalpy (X-axis) values In total, our filtered QMugs dataset consists of 636,821 data points, and our filtered QMugs1.1 dataset consists of 70,546 data points.After application of above procedure, final versions of the QM9 (127,007 data points), BDE-db (289,639 data points), PC9 (96,634 data points), QMugs (636,821 data points) and QMugs1.1 (70,546 data points) wereobtained, and used throughout this study.
提供机构:
DENG, Bowen; Stuyver, Thijs
创建时间:
2024-10-19
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作