A consensus compound/bioactivity dataset for data-driven drug design and chemogenomics
收藏NIAID Data Ecosystem2026-03-13 收录
下载链接:
https://zenodo.org/record/6320760
下载链接
链接失效反馈官方服务:
资源简介:
This is the updated version of the dataset from 10.5281/zenodo.6320761
Information
The diverse publicly available compound/bioactivity databases constitute a key resource for data-driven applications in chemogenomics and drug design. Analysis of their coverage of compound entries and biological targets revealed considerable differences, however, suggesting benefit of a consensus dataset. Therefore, we have combined and curated information from five esteemed databases (ChEMBL, PubChem, BindingDB, IUPHAR/BPS and Probes&Drugs) to assemble a consensus compound/bioactivity dataset comprising 1144648 compounds with 10915362 bioactivities on 5613 targets (including defined macromolecular targets as well as cell-lines and phenotypic readouts). It also provides simplified information on assay types underlying the bioactivity data and on bioactivity confidence by comparing data from different sources. We have unified the source databases, brought them into a common format and combined them, enabling an ease for generic uses in multiple applications such as chemogenomics and data-driven drug design.
The consensus dataset provides increased target coverage and contains a higher number of molecules compared to the source databases which is also evident from a larger number of scaffolds. These features render the consensus dataset a valuable tool for machine learning and other data-driven applications in (de novo) drug design and bioactivity prediction. The increased chemical and bioactivity coverage of the consensus dataset may improve robustness of such models compared to the single source databases. In addition, semi-automated structure and bioactivity annotation checks with flags for divergent data from different sources may help data selection and further accurate curation.
This dataset belongs to the publication: https://doi.org/10.3390/molecules27082513
Structure and content of the dataset
Dataset structure
ChEMBL
ID
PubChem
ID
IUPHAR
ID
Target
Activity
type
Assay type
Unit
Mean C (0)
...
Mean PC (0)
...
Mean B (0)
...
Mean I (0)
...
Mean PD (0)
...
Activity check annotation
Ligand names
Canonical SMILES C
...
Structure check (Tanimoto)
Source
The dataset was created using the Konstanz Information Miner (KNIME) (https://www.knime.com/) and was exported as a CSV-file and a compressed CSV-file.
Except for the canonical SMILES columns, all columns are filled with the datatype ‘string’. The datatype for the canonical SMILES columns is the smiles-format. We recommend the File Reader node for using the dataset in KNIME. With the help of this node the data types of the columns can be adjusted exactly. In addition, only this node can read the compressed format.
Column content:
ChEMBL ID, PubChem ID, IUPHAR ID: chemical identifier of the databases
Target: biological target of the molecule expressed as the HGNC gene symbol
Activity type: for example, pIC50
Assay type: Simplification/Classification of the assay into cell-free, cellular, functional and unspecified
Unit: unit of bioactivity measurement
Mean columns of the databases: mean of bioactivity values or activity comments denoted with the frequency of their occurrence in the database, e.g. Mean C = 7.5 *(15) -> the value for this compound-target pair occurs 15 times in ChEMBL database
Activity check annotation: a bioactivity check was performed by comparing values from the different sources and adding an activity check annotation to provide automated activity validation for additional confidence
no comment: bioactivity values are within one log unit;
check activity data: bioactivity values are not within one log unit;
only one data point: only one value was available, no comparison and no range calculated;
no activity value: no precise numeric activity value was available;
no log-value could be calculated: no negative decadic logarithm could be calculated, e.g., because the reported unit was not a compound concentration
Ligand names: all unique names contained in the five source databases are listed
Canonical SMILES columns: Molecular structure of the compound from each database
Structure check (Tanimoto): To denote matching or differing compound structures in different source databases
match: molecule structures are the same between different sources;
no match: the structures differ. We calculated the Jaccard-Tanimoto similarity coefficient from Morgan Fingerprints to reveal true differences between sources and reported the minimum value;
1 structure: no structure comparison is possible, because there was only one structure available;
no structure: no structure comparison is possible, because there was no structure available.
Source: From which databases the data come from
创建时间:
2022-05-13



