five

ΔvapHm-VOC: Standard Molar Vaporization Enthalpy Database for Machine Learning Prediction Models

收藏
NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/11127879
下载链接
链接失效反馈
官方服务:
资源简介:
We present the full database of the article "Data-Driven, Explainable Machine Learning Model for Predicting Volatile Organic Compounds’ Standard Vaporization Enthalpy". This is the database used for data driven, explainable supervised ML model to predict ΔvapHm° of VOCs. The model was built on an established experimental database of 2410 unique molecules and 223 VOCs categorized by chemical groups. Using supervised ML regression algorithms, the Random Forest successfully predicted VOCs’ ΔvapHm° with a mean absolute error of 3.02 kJ mol-1 and a 94% test score. The model was successfully validated through the prediction of ΔvapHm° for a known database of VOCs and through molecular group hold-out tests. The model's database was built with a variety of molecules from diverse chemical families with known experimental ΔvapHm° values. Entries were collected from Acree and Chickos’ 2010 compilation, curated by Gharagheizi (2013), with experimental vaporization enthalpy at the standard temperature of 298.15 K. This database was selected as it is an open-access repository, generally presenting experimental values with low uncertainties and corrected for the real-to-ideal behavior of the gas phase. We introduced a routine to convert and present each chemical entry into a SMILES string, along with chemical family categorization. For VOCs, we built a specific database of compounds documented in a VOC regulatory environmental guideline (Marlowe et al., 1995), and we used our web-scrapping routine to gather experimental ΔvapHm° values. The external dataset for validation studies was also collected from Gharagheizi (2013). Along with ΔvapHm° experimental values, each molecule is represented by its CAS number, SMILES string and InChlKey. We generated 106 chemical descriptors for every molecule in the database, using RDKit software version 2022.09.4, running on top of Python 3.9. Descriptors were calculated from the “MolFromSmiles” function in “RDKIT.Chem” as descriptors with non-numerical values were removed. The descriptors encode significant chemical information and are used to present physicochemical characteristics of compounds, building a relationship between structure and ΔvapHm°. Through chemical feature importance analysis, the explainable model revealed that VOC polarizability, connectivity indexes and electrotopological state are key for the model’s prediction accuracy. We thus present a replicable and explainable model, which can be further expanded towards the prediction of other thermodynamic properties of VOCs.
创建时间:
2024-05-07
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作