ChemBioSim Recalibration: Twelve Preprocessed ChEMBL Data Sets

NIAID Data Ecosystem2026-03-12 收录

下载链接：

https://zenodo.org/record/5167635

下载链接

链接失效反馈

官方服务：

资源简介：

Studying and mitigating the effects of data drifts on ML model performance at the example of chemical toxicity data Project description Machine learning models are powerful tools for the prediction of molecular properties or the biological activity of chemical compounds. However, to make these models useful and applicable, the confidence in the predictions should also be specified. For that purpose, models may be integrated in a conformal prediction (CP) framework that adds a calibration step to estimate the confidence of the predictions. CP models offer the advantage of ensuring a predefined error rate, as long as the test and training sets are exchangeable. In cases where the test data presents a drift from the descriptor space of the training data, or where assay setups change, this assumption may not be fulfilled and the models are not guaranteed to be valid. In this study, the performance of internally valid CP models was evaluated upon application to either newer time-split data or to external data. More specifically, temporal data drifts were analysed based on time-splits of twelve toxicity-related datasets from the ChEMBL database. Moreover, models trained on publicly available data for liver toxicity and MNT in vivo were applied on proprietary data to evaluate the discrepancies. In general it was observed that the training and (holdout) test sets were not exchangeable in the studied set-ups, and the models were therefore not applicable (i.e. non-valid CP models). To recover the validity of the models on the holdout test set, a strategy for updating the calibration set with data more similar to the holdout set was investigated. Restored validity is the main requisite for applying the CP models with confidence. However, this comes at the cost of decreased model efficiency, as more predictions are identified as inconclusive. Dataset The uploaded file contains the ChEMBL data used in the work for the manuscript “Studying and mitigating the effects of data drifts on ML model performance at the example of chemical toxicity data”. Twelve preprocessed datasets containing molecule chembl ID, SMILES, binary activity (i.e. 1 if active, 0 if inactive), publication year, and CHEMBIO descriptors are available for the following ChEMBL endpoints, extracted from ChEMBL Version 26: CHEMBL220: Acetylcholinesterase (human), 2673 compounds CHEMBL4078: Acetylcholinesterase (fish), 3811 compounds CHEMBL5763: Cholinesterase, 2755 compounds CHEMBL203: EGFR erbB1, 4059 compounds CHEMBL206: Estrogen receptor alpha, 1416 compounds CHEMBL279: VEGFR 2, 5174 compounds CHEMBL230: Cyclooxygenase-2, 2020 compounds CHEMBL340: Cytochrome P450 3A4, 3316 compounds CHEMBL240: HERG, 4976 compounds CHEMBL2039: Monoamine oxidase B, 2534 compounds CHEMBL222: Norepinephrine transporter, 1566 compounds CHEMBL228: Serotonin transporter, 2111 compounds Usage This dataset can be used as input to run the notebooks available at https://github.com/volkamerlab/CPRecalibration_manuscript_SI Clone the GitHub repository. Download the dataset provided here. Copy the dataset (don’t extract) into the data folder of the cloned GitHub repository. Follow the instructions on GitHub.

创建时间：

2021-08-29

5,000+

优质数据集

54 个

任务类型

进入经典数据集