Integrated Protein-Ligand Interaction Database
收藏NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/records/2649608
下载链接
链接失效反馈官方服务:
资源简介:
Computational prediction of genome-wide protein-ligand interactions plays a key role in drug discovery, toxicology, and in many other applications. Despite recent advances in deep learning, the large quantity of high-quality data required for training and evaluating models has impeded its applications in computational drug development. To obviate this problem, we have developed an integrated Protein-Ligand Interaction Database (IPLID). IPLID integrates protein-ligand interaction data from multiple well-known resources, including BindingDB, ChEMBL, DrugBank, GPCRDB, PubChem, LINCS-HMS KinomeScan, and four published kinome assay results. IPLID is enabled with search functionalities specifically designed for machine learning, particularly deep learning projects. Users can retrieve numerically or binary labeled (e.g. pki, pkd, or binary) protein-ligand interaction data for different classes of proteins (e.g. GPCRs, kinases, FDA-approved targets, protein products of cancer-related genes, etc.). To facilitate the development of benchmarks for training and testing of machine learning algorithms, it also provides chemical-chemical structure similarity scores calculated by a well-established method, Tanimoto coefficient (Jaccard similarity) of two Extended Connectivity Fingerprint (ECFP4) molecular representations. Protein sequence similarities by BLAST score comparison and position-specific scoring matrices against UniRef50 sequence database are also available for more complicated protein-ligand interaction modeling projects. We believe our database can facilitate projects in machine learning or deep learning-based drug development and other applications by providing integrated data sets appropriate for many research interests. Our database can be utilized for small-scale (e.g. kinases or GPCRs only) and large-scale (e.g. proteome-wide), qualitative or quantitative projects. With its ease of use and straightforward data format, IPLID offers a great educational resource for computer science and data science trainees who lack familiarity with chemistry and biology.
Activities are in tab-delimited text file formats (.tsv).
Binary activities are under 'binary_activity' directory, and numerical activities are under 'numerical_activity' directory.
File names are in "(targets)_(activity_type).tsv"
Long target names are abbreviated; abbreviations listed below.
Ligand-ligand similarity scores are under 'ligand_info' directory.
Protein-protein similarity scores and position-specific scoring matrices are under 'protein_info' directory.
Primary ligand-id and protein-id are InChIKey and UniProt ID, respectively.
*Abbreviations: CYP450 (Cytochrome P450), CRT (Cancer-Related Target), CDT (Cardiovascular Disease candidate Target), DRT (Disease-Related Target), FDA (FDA-approved target), GPCR (G-Protein Coupled Receptor), NR (Nuclear Receptor), PDT (Potential Drug Target), TF (Transcription Factor)
*These protein classifications are from UniProt database and the Human Protein Atlas (https://www.proteinatlas.org/)
创建时间:
2024-07-24



