A Knowledge-Guided Graph Representation Learning Approach for Omics-based Therapeutic Molecule Discovery
收藏NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/8394246
下载链接
链接失效反馈官方服务:
资源简介:
# KGDRP: Interpretable drug response prediction with biomedical networks accelerates phenotypic drug discovery and drug target prioritization
We present a multimodal data fusion framework called Knowledge-Guided Drug Relational Predictor (KGDRP). This framework seamlessly integrates multiple omics data, including network data containing biological system information, gene expression data coupled with phenotypic information, and sequence data incorporating chemical molecular structures, within a heterogeneous graph structure.
## System Requirements
The source code developed in Python 3.6.13 using PyTorch 1.10.1. The required python dependencies are given below. KGDRP is supported for any standard computer and operating system with enough RAM to run. There is no additional non-standard hardware requirements.
```ptyhon
torch = 1.10.1
dgl = 0.8.2
numpy = 1.19.5
scikit-learn = 0.24.2
pandas = 1.1.5
tqdm = 4.64.0
scipy = 1.5.4
optuna = 2.10.1
```
- **Description**: This dataset contains the knowledge graph data that represents relationships between biological entities, including drugs, proteins, and diseases. It
## Instructions for Use
Knowledge graph and drug response data used in this work provided in the ./data/ folder
1. **Knowledge graph**
- **File**: ```dppc_kg.csv```
- **Description**: BioHG centers on describing the interactions among drugs, proteins, and cell lines, with node and edge selection guided by task relevance and data availability. Consequently, the architecture of BioHG incorporates key data types: drug response data capturing drug-cell relationships, drug-target data describing drug-protein interactions, and RNA expression profiles of cell lines representing protein-cell line relationships. However, inferring the biological function of proteins, which is the key part of decoding the drug mechanism, is challenging when relying solely on the three relational data sources mentioned above. Benefited from the availability of abundant omics data, BioHG incorporates protein-protein interactions (PPI) from the UniProt database to enrich protein functional representations and enhance network connectivity. To provide a more detailed characterization of protein biological functions, BioHG integrates Gene Ontology (GO) data from the UniProt database and pathway data from the Reactome database, enriching foundational and higher-order biological information, respectively.
- **Statistics**
| Node Type | Node Number | Edge Type | Edge Number | Data Source |
|-------------|-------------|-------------|-------------|-------------|
| Drug | 7070 | Drug-Target Interaction | 28033 | [DrugBank](https://go.drugbank.com) |
| Protein | 117841 | Protein-Protein Interaction| 234454 | [BioKG](https://github.com/dsi-bdi/biokg) |
| Pathway | 21178 | Pathway-Protein Association| 807539 | [Reactome](https://reactome.org/download-data) |
| Gene Ontology | 804 | Protein-Gene Ontology Association| 891800 | [UniProt](https://www.uniprot.org) |
| Cell Line | 117841 | Protein-Cell line Association| 392554 | [GDSC](https://www.cancerrxgene.org) |
- **Note** : For pathway data, we only retain human-related pathways. For gene ontology data, we searched for proteins involved in Drug-Target Interaction and Protein-Protein Interaction in UniProt and exported the corresponding Gene Ontology data.
2. **CV folds of drug response data**
- ```./data/cv_mix/```
- ```./data/cv_cell/```
- ```./data/cv_drug/```
- ```./data/cv_both/```
3. **Drug ID map file**
- **File**: ```cid_infor.csv```
- **Description**: This file contains a mapping between drug identifiers. It includes unique identifiers (ID) for drugs, which can be used for linking drugs to their molecular fingerprints in ```bdki_db_gdsc_fp.csv```.
The column ```drug_name``` contains the origin identifier in the drug response data.
The column ```smiles``` contains molecular sequence representing the drug’s chemical structure.
The column ```stand_smiles``` contains standard molecular sequence converted by [RDKit](https://www.rdkit.org/docs/index.html) for alignment.
The column ```cid``` contains the CID identifier aligned with [PubChem Database](https://pubchem.ncbi.nlm.nih.gov).
4. **Feature of cell-lines**
- **File**:```rna_input.csv```
- **Description**: This file serves as the node feature representation for each cell line. Each row represents a cell line, and each column represents a gene. The file includes identifiers for both the cell lines and the genes.
5. **Edges of gene-cell-lines for KG construction**
- **File**:```rna_triples.csv```
- **Description**: This file consists of expression triples representing relationships between genes and their expression levels. It is crucial for capturing the expression profiles of samples and understanding their role in disease pathology.
6. **Feature of drugs**
- **File**:```bdki_db_gdsc_fp.csv```
- **Description**: This file contains the molecular fingerprints for a set of compounds, represented as 1024-bit Morgan fingerprints generated using the RDKit. Each row in the file corresponds to a compound, and each column represents one bit of the 1024-bit fingerprint.
7. **Negative samples of protein-pathway**
- **File**:```pro_path_neg_sp.csv```
- **Description**: This dataset provides information about negative relationships between pathways and proteins. The negative triples are sampled from the BioHG by using the shortest path between each protein and its corresponding pathway.
8. **Negative samples of protein-pathway**
- **File**:```neg_dpi_df_t10.csv```
- **Description**: This dataset includes negative interactions between drug and proteins. The negative triples are sampled from the BioHG by using the shortest path between each protein and its corresponding drug.
### Hyper-parameter
The hyper-parameter script is located in the ./src/ folder.
**Noted:** To implement heterogeneous message passing, replace the hetero.py file located at ./anaconda3/envs/env_name/lib/python3.6/site-packages/dgl/nn/pytorch/hetero.py with the version provided in the ./src/ folder.
The script is compatible with both CPU and GPU architectures. A single run of the script typically requires approximately 30 minutes to complete.
## License
[MIT](https://choosealicense.com/licenses/mit/)
创建时间:
2025-01-20



