ChemBioSim Recalibration: Twelve Preprocessed ChEMBL Data Sets
收藏NIAID Data Ecosystem2026-03-12 收录
下载链接:
https://zenodo.org/record/5167635
下载链接
链接失效反馈官方服务:
资源简介:
Studying and mitigating the effects of data drifts on ML model performance at the example of chemical toxicity data
Project description
Machine learning models are powerful tools for the prediction of molecular properties or the biological activity of chemical compounds. However, to make these models useful and applicable, the confidence in the predictions should also be specified. For that purpose, models may be integrated in a conformal prediction (CP) framework that adds a calibration step to estimate the confidence of the predictions. CP models offer the advantage of ensuring a predefined error rate, as long as the test and training sets are exchangeable.
In cases where the test data presents a drift from the descriptor space of the training data, or where assay setups change, this assumption may not be fulfilled and the models are not guaranteed to be valid.
In this study, the performance of internally valid CP models was evaluated upon application to either newer time-split data or to external data. More specifically, temporal data drifts were analysed based on time-splits of twelve toxicity-related datasets from the ChEMBL database. Moreover, models trained on publicly available data for liver toxicity and MNT in vivo were applied on proprietary data to evaluate the discrepancies. In general it was observed that the training and (holdout) test sets were not exchangeable in the studied set-ups, and the models were therefore not applicable (i.e. non-valid CP models).
To recover the validity of the models on the holdout test set, a strategy for updating the calibration set with data more similar to the holdout set was investigated. Restored validity is the main requisite for applying the CP models with confidence. However, this comes at the cost of decreased model efficiency, as more predictions are identified as inconclusive.
Dataset
The uploaded file contains the ChEMBL data used in the work for the manuscript “Studying and mitigating the effects of data drifts on ML model performance at the example of chemical toxicity data”.
Twelve preprocessed datasets containing molecule chembl ID, SMILES, binary activity (i.e. 1 if active, 0 if inactive), publication year, and CHEMBIO descriptors are available for the following ChEMBL endpoints, extracted from ChEMBL Version 26:
CHEMBL220: Acetylcholinesterase (human), 2673 compounds
CHEMBL4078: Acetylcholinesterase (fish), 3811 compounds
CHEMBL5763: Cholinesterase, 2755 compounds
CHEMBL203: EGFR erbB1, 4059 compounds
CHEMBL206: Estrogen receptor alpha, 1416 compounds
CHEMBL279: VEGFR 2, 5174 compounds
CHEMBL230: Cyclooxygenase-2, 2020 compounds
CHEMBL340: Cytochrome P450 3A4, 3316 compounds
CHEMBL240: HERG, 4976 compounds
CHEMBL2039: Monoamine oxidase B, 2534 compounds
CHEMBL222: Norepinephrine transporter, 1566 compounds
CHEMBL228: Serotonin transporter, 2111 compounds
Usage
This dataset can be used as input to run the notebooks available at
https://github.com/volkamerlab/CPRecalibration_manuscript_SI
Clone the GitHub repository.
Download the dataset provided here.
Copy the dataset (don’t extract) into the data folder of the cloned GitHub repository.
Follow the instructions on GitHub.
创建时间:
2021-08-29



