SaintGSE: Transformer-based efficient and explainable gene set enrichment analysis

NIAID Data Ecosystem2026-05-02 收录

下载链接：

https://zenodo.org/record/14219499

下载链接

链接失效反馈

官方服务：

资源简介：

SaintGSE: Transformer-based efficient and explainable gene set enrichment analysis SaintGSE is an artificial intelligence model designed to predict human gene-pathway relationships using large-scale differentially expressed gene (DEG) datasets. By leveraging an autoencoder and the SAINT transformer model, SaintGSE overcomes challenges in gene expression analysis, such as data scarcity, model compatibility, and interpretability. This project fine-tuned codes from the SAINT project (https://github.com/somepago/saint), licensed under the Apache License 2.0. Key Features * AI-Driven Pathway Prediction: Uses autoencoders and the SAINT model to analyze gene expression data and predict related signaling pathways. * Osteoarthritis Study: Applied to osteoarthritis (OA) to identify key pathways and potential therapeutic targets. * Explainability: Utilizes Shapley additive explanations (SHAP) to interpret model predictions and identify influential genes. Installation Before installation, we recommend to build a conda environment from the attached yml file and activate it. Our code has been tested with python=3.8 on linux. ``` $ cd /path/to/SaintGSE $ conda env create -f saintgse_env.yml $ conda activate saintgse_env ``` After downloading all the files, please extract the contents of all compressed directories by running the following command in your terminal: ``` $ find . -name "*.tar.gz" -exec tar -xzvf {} \; $ rm *.tar.gz ``` Once the file structure is formed as follows, the preparation for using SaintGSE is complete. ``` . ├── datasets │ ├── AE_100cycle_model.pth │ ├── AE_enrichment.tsv │ ├── AEshap │ │ ├── shap_values_latent_dim_1.csv │ │ ├── shap_values_latent_dim_2.csv │ │ ├── shap_values_latent_dim_3.csv │ │ ├── ... │ │ ├── ... │ │ └── shap_values_latent_dim_256.csv │ ├── bestmodels │ │ └── binary │ │ ├── Chronic_Myeloid_Leukemia │ │ │ └── testrun │ │ │ └── saint_gse_model.pth │ │ │── ... │ │ └── Selencompound_Biosynthesis │ │ └── testrun │ │ └── saint_gse_model.pth │ ├── gene_list.pkl │ ├── MGI_Gene_Model_Coord.tsv │ └── pathway_list_in_DEG.txt ``` The code in this dataset is also accessible via GitHub. You can find the GitHub repository at the following link: https://github.com/MSjeon27/SaintGSE DEG dataset preparation Prior to SaintGSE analysis, prepare DEG data to be used as input in .tsv format as follows. In the column, the official gene symbol of DEGs is located, and the row adds the log2 fold change value in each DEG group. An example is as follows. ``` LAP3 CD99 HS3ST1 MAD1L1 LASP1 SNX11 'mock-6' vs 'LPS-6' -1.3 0 2.4 0 0.7 0 'mock-6' vs 'EBOV-6' -1.3 0 2.3 0 0.6 0 ``` Usage Step 0. Preprocessing the input DEG (from pyDESeq2 result) Currently, SaintGSE has the function of converting mouse genes into human genes. The preprocessing code serves to change the human or mouse DEG data into the format used for SaintGSE. * human DEGs ``` $ preprocessing.py --query_fc /path/to/your/DEGs.tsv --out Preprocessed_fc.tsv ``` * mouse DEGs ``` $ preprocessing.py --query_fc /path/to/your/DEGs.tsv --org mouse --out Preprocessed_fc.tsv ``` Step 1. Training SaintGSE for a target pathway SaintGSE can be used to analyze new gene expression datasets for pathway prediction: ``` $ SaintGSE.py --pathway 'Proteins Involved in Osteoarthritis' --pretrain ``` Step 2. Prediction through SaintGSE ``` $ SaintGSE.py --predict Preprocessed_fc.tsv --pathway 'Proteins Involved in Osteoarthritis' ``` The results of the predictions are as follows. ``` tensor([[1.]], device='cuda:0') ``` This indicates that your DEG data is related to the target signaling path. Step 3. Interpretation the result of SaintGSE (Get Relative SHAP contribution for each DEGs) ``` $ Interpret.py -d Preprocessed_fc.tsv -p 'Proteins Involved in Osteoarthritis' ``` The result of interpretation produces the following files for each sample. ``` __significant_gene_shap.csv ``` This represents the relative SHAP contribution for each gene in the DEG data. In the subsequent analysis, it is recommended to focus on the genes with the relative SHAP contributions in the top 35% to 50% as we suggested in the paper, depending on the number of DEGs. How to Cite If you use this model or repository in your research, please cite it as follows: ``` Jeon, MS & Nam, JH et al., "SaintGSE: Transformer-based efficient and explainable gene set enrichment analysis," 2024. GitHub repository. Available at: https://github.com/MSjeon27/SaintGSE ``` For more information or any questions regarding citation, feel free to contact us (msjeon27@cau.ac.kr).

创建时间：

2024-11-26

5,000+

优质数据集

54 个

任务类型

进入经典数据集