ARTDeco Output
收藏DataCite Commons2023-12-18 更新2024-08-18 收录
下载链接:
https://figshare.com/articles/dataset/ARTDeco_Output/24848265/1
下载链接
链接失效反馈官方服务:
资源简介:
<b>Transcriptome profiles from Human healthy tissues</b>RNA samples (BAM files) were accessed on 2021/04/01 from the Genotype-Tissue Expression (GTEx; release v8) project allocated to the NCBI database of Genotypes and Phenotypes (dbGaP) <sup>17–19</sup>. Authorization was granted to dbGaP Accession phs000424.v8. p2, where NIH Genomic Data Sharing Policy policies are applied to protect the privacy of patients (all information is anonymized). The GTEx platform includes approximately 948 postmortem donors, from whom RNA samples from several tissues were isolated in an ongoing manner as donors were enrolled in the study. We considered only paired-end samples with at least 60 million reads per sample and prepared with the Illumina TruSeq library construction protocol (non-strand specific polyA+ selected library). Cell culture samples and tissues containing fewer than 50 samples were excluded. Healthy subjects were selected by filtering samples for “violent and fast deaths" and "no terminal diseases". We obtained 2778 samples from 23 healthy human tissues that were used for downstream analyses.<b>Transcription readthrough detection</b>To detect transcription readthrough (TRT), we first converted the downloaded BAM files from dbGaP back to FASTQ using samtools (v.1.10) <sup>44</sup>, and then re-aligned them to the reference genome (GRCh38 assembly; release 37, GRCh38.p13) using STAR (v2.7.8a) <sup>45</sup>. To detect the transcription readthrough (TRT), we used ARTDeco <sup>20</sup>, a pipeline for analyzing and characterizing transcriptional readthrough that searches for continuous coverage over a minimal length downstream of the 3’end of each gene locus (annotation version 37, Ensembl 103) using a rolling window approach. The transcription levels of the window must meet the thresholds to be considered part of the readthrough tail. We used a rolling window of 500bp, minimum length of 2000 bp, and minimum coverage of 0.15 FPKM. ARTDeco uses HOMER’s tools <sup>46</sup> o select only uniquely mapped reads for downstream analysis and returns a variety of metrics to measure readthrough. We used the information contained inside the “quantification” and “dogs” folders (expression levels and novel transcripts created as a result of readthrough, respectively) for downstream analysis.As GTEx samples were profiled using non-stranded RNAseq libraries, a significant number of reads identified as downstream transcripts corresponded to reads coming from genes being expressed in the opposite direction. Because transcriptional signals can come from either direction, ARTDeco is ambiguous when inferring a true downstream transcript in some cases. To eliminate these dubious cases created by the lack of strandedness (designated as undefined genes), we filtered the output from ARTDeco to report only entries that did not overlap with genes in the opposite strand, using the intersect function from bedtools (v2.30.0) <sup>47</sup>. This approach discards RT transcripts with close downstream neighbors in the opposite strand but ensures that our list of readthrough genes is robust. In addition, only RT transcripts from the expressed genes in each given tissue were considered for downstream analysis. Expressed genes were defined as those with FPKM > 1 in at least 25% of the samples of a given tissue.
<b>人类健康组织转录组谱</b>
RNA样本(BAM文件)于2021年4月1日从基因型-组织表达(Genotype-Tissue Expression, GTEx;v8版本)项目获取,该项目数据存储于美国国立生物技术信息中心(NCBI)的基因型与表型数据库(dbGaP)<sup>17–19</sup>。本研究已获得dbGaP登录号phs000424.v8.p2的使用授权,该项目遵循美国国立卫生研究院(NIH)基因组数据共享政策以保护受试者隐私,所有信息均已匿名化。GTEx平台共纳入约948名死后供者,在受试者入组期间持续采集其多种组织的RNA样本。我们仅选取满足以下条件的双端测序样本:单样本reads数不少于6000万,且采用Illumina TruSeq建库方案制备(非链特异性polyA+富集文库)。本研究排除细胞培养样本以及样本量不足50的组织。健康受试者的筛选标准为"暴力快速死亡"且"无终末期疾病"。最终我们从23种健康人类组织中获取了2778个样本,用于后续分析。
<b>转录通读(transcription readthrough, TRT)检测</b>
为检测转录通读(TRT),我们首先使用samtools(v1.10)<sup>44</sup>将从dbGaP下载的BAM文件转换为FASTQ格式,随后使用STAR(v2.7.8a)<sup>45</sup>将reads比对至参考基因组(GRCh38组装版本;release 37,GRCh38.p13)。我们采用ARTDeco<sup>20</sup>工具开展转录通读检测,该工具是一款用于分析与表征转录通读事件的流程,通过滚动窗口法搜索每个基因座3'端下游至少指定长度的连续覆盖区域(注释版本37,Ensembl 103)。只有当窗口内的转录水平满足预设阈值时,该区域才会被判定为通读尾的一部分。本研究设置的参数为:滚动窗口长度500bp,最小通读长度2000bp,最小覆盖度0.15 FPKM。ARTDeco借助HOMER工具<sup>46</sup>仅保留唯一比对的reads用于后续分析,并输出多种用于评估通读事件的量化指标。我们分别使用"quantification"与"dogs"文件夹中的信息开展后续分析,前者对应基因表达水平,后者对应由通读事件产生的新转录本。
由于GTEx样本采用非链特异性RNA测序建库,大量被鉴定为下游转录本的reads实际来自反义链上的表达基因。鉴于转录信号可来自任意链方向,ARTDeco在部分场景下无法准确推断真实的下游转录本。为消除因建库无链特异性导致的此类可疑结果(定义为未定义基因),我们使用bedtools(v2.30.0)<sup>47</sup>的intersect功能对ARTDeco的输出结果进行过滤,仅保留不与反义链基因重叠的条目。该方法会排除与反义链上邻近基因紧邻的通读转录本,但可确保我们得到的通读基因列表具备可靠性。此外,后续分析仅纳入各组织中表达基因的通读转录本。表达基因的定义为:在给定组织的至少25%的样本中,FPKM值大于1。
提供机构:
figshare
创建时间:
2023-12-18



