Ndungu.etl_GTEx SQLiteDBs
收藏DataCite Commons2020-08-26 更新2024-07-27 收录
下载链接:
https://figshare.com/articles/Ndungu_etl_GTEx_SQLiteDBs/10324055/1
下载链接
链接失效反馈官方服务:
资源简介:
A multi-tissue transcriptome analysis of human metabolites guides the interpretability of associations based on multi-SNP models for gene expression<br>This file contains gene expression prediction models for 43 tissues from GTEx v7 used to perform TWAS across 46 metabolites in Ndungu et al (2019).For each tissue, LASSO regression was used to select an optimal set of SNPs with non-zero effects on gene expression. Regression was performed using GLMNET in R on each gene, with all SNPs less than 1MB from any part of each gene as potential covariates. To select the optimal penalty factor for each gene, mean squared error (MSE) was calculated using 10-fold cross-validation across 100 automatically selected potential penalty factors. For genes with multiple SNPs selected by LASSO regression, all selected SNPs were first linearly modelled against the gene’s expression. For any groups of SNPs in perfect LD, one was randomly selected and retained. Model R2 was calculated for the full linear model. Iteratively, starting with the SNP with the lowest p-value in the model, SNPs were added back one-at- a-time, each time calculating the subset model’s R2 (i.e. forward regression). Once 95% of the full model’s R2 value was attained; any SNPs not in the current subset model were eliminated. The final subset of SNPs was then modelled against expression and smoothed using ridge regression to minimize overfitting; with penalty factors selected using 25 iterations of 10-fold cross-validated ridge regression. For genes with only one SNP selected by LASSO, this SNP alone was modelled against gene expression using 25 iterations of 10-fold cross-validated ridge regression. The final coefficients from ridge regression models were carried forward for use in S-PrediXcan.Each tissue’s prediction model is an SQLite database with two tables. The schemas for the tables are as follows: extra – holds data about each linear model for predicting the transcriptome in the tissue. The column names with descriptions are listed here:o gene – The ensembl ID of the gene<br>o genename – The gene’s HUGO symbolo pred.perf.R2 – The cross-validated R2 value found when training the model.o n.snps.in.model – The number of cissnps used to predict the expression level of the geneo pred.perf.pval – The p-value of the correlation between cross-validated prediction and observed expressiono pred.perf.qval – The q-value obtained when analyzing the initial distribution of p-values. weights – the weights for the snps in the linear models. The column names with descriptions are listed here:o rsid – The rsid number for the snp from dbSNP build 142 o gene – The ensembl ID of the gene for which the snpweight is predicting expression<br>o weight – The weight value for the snp in the model<br>o ref_allele – The other (non-effect, non-dosage) allele ofthe snp<br>o eff_allele – The effect (dosage) allele of the snpReference: Ndungu A. et al. (2019). A multi-tissue transcriptome analysis of human metabolites guides the interpretability of associations based on multi-SNP models for gene expression.Please refer any queries to:<br>Mark McCarthy (mccarthy.mark@gene.com)
提供机构:
figshare
创建时间:
2019-11-19



