BioCompoundML: A General Biofuel Property Screening Tool for Biological Molecules Using Random Forest Classifiers
收藏NIAID Data Ecosystem2026-03-09 收录
下载链接:
https://figshare.com/articles/dataset/BioCompoundML_A_General_Biofuel_Property_Screening_Tool_for_Biological_Molecules_Using_Random_Forest_Classifiers/3971655
下载链接
链接失效反馈官方服务:
资源简介:
Screening
a large number of biologically derived molecules for
potential fuel compounds without recourse to experimental testing
is important in identifying understudied yet valuable molecules. Experimental
testing, although a valuable standard for measuring fuel properties,
has several major limitations, including the requirement of testably
high quantities, considerable expense, and a large amount of time.
This paper discusses the development of a general-purpose fuel property
tool, using machine learning, whose outcome is to screen molecules
for desirable fuel properties. BioCompoundML adopts a general methodology,
requiring as input only a list of training compounds (with identifiers
and measured values) and a list of testing compounds (with identifiers).
For the training data, BioCompoundML collects open data from the National
Center for Biotechnology Information, incorporates user-provided features,
imputes missing values, performs feature reduction, builds a classifier,
and clusters compounds. BioCompoundML then collects data for the testing
compounds, predicts class membership, and determines whether compounds
are found in the range of variability of the training data set. This
tool is demonstrated using three different fuel properties: research
octane number (RON), threshold soot index (TSI), and melting point
(MP). We provide measures of its success with these properties using
randomized train/test measurements: average accuracy is 88% in RON,
85% in TSI, and 94% in MP; average precision is 88% in RON, 88% in
TSI, and 95% in MP; and average recall is 88% in RON, 82% in TSI,
and 97% in MP. The receiver operator characteristics (area under the
curve) were estimated at 0.88 in RON, 0.86 in TSI, and 0.87 in MP.
We also measured the success of BioCompoundML by sending 16 compounds
for direct RON determination. Finally, we provide a screen of 1977
hydrocarbons/oxygenates within the 8696 compounds in MetaCyc, identifying
compounds with high predictive strength for high or low RON.
创建时间:
2016-10-14



