Systematic Approaches for the Encoding of Chemical Groups: A Case Study
收藏NIAID Data Ecosystem2026-05-01 收录
下载链接:
https://figshare.com/articles/dataset/Systematic_Approaches_for_the_Encoding_of_Chemical_Groups_A_Case_Study/25429694
下载链接
链接失效反馈官方服务:
资源简介:
Regulatory
authorities aim to organize substances into groups to
facilitate prioritization within hazard and risk assessment processes.
Often, such chemical groupings are not explicitly defined by structural
rules or physicochemical property information. This is largely due
to how these groupings are developed, namely, a manual expert curation
process, which in turn makes updating and refining groupings, as new
substances are evaluated, a practical challenge. Herein, machine learning
methods were leveraged to build models that could preliminarily assign
substances to predefined groups. A set of 86 groupings containing
2,184 substances as published on the European Chemicals Agency (ECHA)
website were mapped to the U.S. Environmental Protection Agency (EPA)
Distributed Toxicity Structure Database (DSSTox) content to extract
chemical and structural information. Substances were represented using
Morgan fingerprints, and two machine learning approaches were used
to classify test substances into 56 groups containing at least 10
substances with a structural representation in the data set: k-nearest
neighbor (kNN) and random forest (RF), that led to mean 5-fold cross-validation
test accuracies (average F1 scores) of 0.781 and 0.853, respectively.
With a 9% improvement, the RF classifier was significantly more accurate
than KNN (p-value = 0.001). The approach offers promise
as a means of the initial profiling of new substances into predefined
groups to facilitate prioritization efforts and streamline the assessment
of new substances when earlier groupings are available. The algorithm
to fit and use these models has been made available in the accompanying
repository, thereby enabling both use of the produced models and refitting
of these models, as new groupings become available by regulatory authorities
or industry.
创建时间:
2024-03-18



