SaintGSE: Transformer-based efficient and explainable gene set enrichment analysis
收藏NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/14219499
下载链接
链接失效反馈官方服务:
资源简介:
SaintGSE: Transformer-based efficient and explainable gene set enrichment analysis
SaintGSE is an artificial intelligence model designed to predict human gene-pathway relationships using large-scale differentially expressed gene (DEG) datasets. By leveraging an autoencoder and the SAINT transformer model, SaintGSE overcomes challenges in gene expression analysis, such as data scarcity, model compatibility, and interpretability. This project fine-tuned codes from the SAINT project (https://github.com/somepago/saint), licensed under the Apache License 2.0.
Key Features
* AI-Driven Pathway Prediction: Uses autoencoders and the SAINT model to analyze gene expression data and predict related signaling pathways.
* Osteoarthritis Study: Applied to osteoarthritis (OA) to identify key pathways and potential therapeutic targets.
* Explainability: Utilizes Shapley additive explanations (SHAP) to interpret model predictions and identify influential genes.
Installation
Before installation, we recommend to build a conda environment from the attached yml file and activate it.
Our code has been tested with python=3.8 on linux.
```
$ cd /path/to/SaintGSE
$ conda env create -f saintgse_env.yml
$ conda activate saintgse_env
```
After downloading all the files, please extract the contents of all compressed directories by running the following command in your terminal:
```
$ find . -name "*.tar.gz" -exec tar -xzvf {} \;
$ rm *.tar.gz
```
Once the file structure is formed as follows, the preparation for using SaintGSE is complete.
```
.
├── datasets
│ ├── AE_100cycle_model.pth
│ ├── AE_enrichment.tsv
│ ├── AEshap
│ │ ├── shap_values_latent_dim_1.csv
│ │ ├── shap_values_latent_dim_2.csv
│ │ ├── shap_values_latent_dim_3.csv
│ │ ├── ...
│ │ ├── ...
│ │ └── shap_values_latent_dim_256.csv
│ ├── bestmodels
│ │ └── binary
│ │ ├── Chronic_Myeloid_Leukemia
│ │ │ └── testrun
│ │ │ └── saint_gse_model.pth
│ │ │── ...
│ │ └── Selencompound_Biosynthesis
│ │ └── testrun
│ │ └── saint_gse_model.pth
│ ├── gene_list.pkl
│ ├── MGI_Gene_Model_Coord.tsv
│ └── pathway_list_in_DEG.txt
```
The code in this dataset is also accessible via GitHub. You can find the GitHub repository at the following link:
https://github.com/MSjeon27/SaintGSE
DEG dataset preparation
Prior to SaintGSE analysis, prepare DEG data to be used as input in .tsv format as follows. In the column, the official gene symbol of DEGs is located, and the row adds the log2 fold change value in each DEG group. An example is as follows.
```
LAP3 CD99 HS3ST1 MAD1L1 LASP1 SNX11
'mock-6' vs 'LPS-6' -1.3 0 2.4 0 0.7 0
'mock-6' vs 'EBOV-6' -1.3 0 2.3 0 0.6 0
```
Usage
Step 0. Preprocessing the input DEG (from pyDESeq2 result)
Currently, SaintGSE has the function of converting mouse genes into human genes. The preprocessing code serves to change the human or mouse DEG data into the format used for SaintGSE.
* human DEGs
```
$ preprocessing.py --query_fc /path/to/your/DEGs.tsv --out Preprocessed_fc.tsv
```
* mouse DEGs
```
$ preprocessing.py --query_fc /path/to/your/DEGs.tsv --org mouse --out Preprocessed_fc.tsv
```
Step 1. Training SaintGSE for a target pathway
SaintGSE can be used to analyze new gene expression datasets for pathway prediction:
```
$ SaintGSE.py --pathway 'Proteins Involved in Osteoarthritis' --pretrain
```
Step 2. Prediction through SaintGSE
```
$ SaintGSE.py --predict Preprocessed_fc.tsv --pathway 'Proteins Involved in Osteoarthritis'
```
The results of the predictions are as follows.
```
tensor([[1.]], device='cuda:0')
```
This indicates that your DEG data is related to the target signaling path.
Step 3. Interpretation the result of SaintGSE (Get Relative SHAP contribution for each DEGs)
```
$ Interpret.py -d Preprocessed_fc.tsv -p 'Proteins Involved in Osteoarthritis'
```
The result of interpretation produces the following files for each sample.
```
__significant_gene_shap.csv
```
This represents the relative SHAP contribution for each gene in the DEG data. In the subsequent analysis, it is recommended to focus on the genes with the relative SHAP contributions in the top 35% to 50% as we suggested in the paper, depending on the number of DEGs.
How to Cite
If you use this model or repository in your research, please cite it as follows:
```
Jeon, MS & Nam, JH et al., "SaintGSE: Transformer-based efficient and explainable gene set enrichment analysis," 2024. GitHub repository. Available at: https://github.com/MSjeon27/SaintGSE
```
For more information or any questions regarding citation, feel free to contact us (msjeon27@cau.ac.kr).
创建时间:
2024-11-26



