MOCCASIN: A method for correcting for known and unknown confounders in RNA splicing analysis
收藏NIAID Data Ecosystem2026-03-12 收录
下载链接:
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE162664
下载链接
链接失效反馈官方服务:
资源简介:
While the effects of confounders on gene expression analysis have been extensively studied there is a lack of equivalent analysis and tools for RNA splicing analysis. Here we assess the effect of confounders in two large public RNA-Seq datasets (TARGET, ENCODE), develop a new method, MOCCASIN, to correct the effect of both known and unknown confounders on RNA splicing quantification, and demonstrate MOCCASIN’s effectiveness on both synthetic and real data. RNA-Seq samples were simulated based on transcript expression profiles derived from real mouse aorta (N=8) and cerebellum (N=8) samples from GSE54651. 16 simulated samples were based on the ground truth transcipt expression profiles. The remainder of the simulated RNA-Seq samples were based on transcript profiles with batch effects injected at three effect size (0.02, 0.05, and 0.6) and the batch effects were introduced to transcripts from three different proportions of genes (0.02, 0.05, and 0.2). Data Processing: 1. fastqs from original RNA-Seq samples (GSE54652) were aligned to GENCODE m21 with STAR 2. transcript level TPMs were quantified with Salmon 3. Batch effects were introduced to the mouse aorta and cerebellum sample transcript level TPMs. The aorta samples used were SRR1158521, SRR1158522, SRR1158523, SRR1158524, SRR1158525, SRR1158526, SRR1158527, SRR1158528 and the cerebellum samples used were SRR1158545, SRR1158546, SRR1158547, SRR1158548, SRR1158549, SRR1158550, SRR1158551, SRR1158552. The first four of the aorta samples (SRR1158521...24) and the first four of the cerebellum samples (SRR1158545...48) were defined as ?batch 1.? The last four of the aorta samples (SRR1158525...28) and the last half of the cerebellum samples (SRR1158549...52) were defined as ?batch 2.? Batch effect perturbations were only ever introduced to the batch 2 samples and batch effects were restricted to genes with at least two protein coding transcripts and at least one transcript with > 10 reads per kilobase of transcript length in every sample. The procedure to introduce batch effects is as follows: first, the most abundant protein coding transcript per gene was identified as the transcript having the maximum over all transcripts of the minimum reads per kilobase over all samples. This definition ensures the selected transcript is not zero in any sample. Then, for a given gene a batch effect was introduced by (1) selecting a transcript uniformly at random (excluding the most abundant transcript) and (2) reducing TPM of the most abundant transcript by a factor of ?C% Change in Isoform TPM? and correspondingly increasing the TPM of the randomly selected transcript, thus maintaining the overall TPM of the gene and not breaking the definition of TPM (sum of all TPMs in a sample is one million). The ?C% Change in Isoform TPM? factors included 2%, 10%, and 60%. In addition to introducing three different levels of percent changes in isoform TPMs, we also varied the percent of genes batch-effected. We introduced batch effects to G=0%, 2%, 5% or 20% of all genes with protein coding transcripts. 4. simulated fastqs were generated with BEERS algorithm (Grant et. al., 2011) 5. Salmon transcript level TPMs and simulated fastqs are linked below as supplementary files
创建时间:
2021-06-15



