Human and Mouse UTRomes
收藏Figshare2024-04-01 更新2026-04-28 收录
下载链接:
https://figshare.com/articles/dataset/Human_and_Mouse_UTRomes/23549526
下载链接
链接失效反馈官方服务:
资源简介:
OverviewThis dataset contains BED and GTF files representing the cleavage sites and 3'UTR isoform annotations derived from reprocessing Microwell-seq data. These objects are part of the minimum dataset required for verifying the analysis reported in Fansler et al., bioRxiv, 2023.DescriptionThe BED files contain candidate cleavage sites from the Mouse Cell Atlas and Human Cell Landscape datasets. In brief, paired-end reads were merged with PEAR when overlapping, cell barcodes extracted with umi_tools, poly-A tails removed with cutadapt, and then remaining reads mapped to the hg38 or mm10 genomes using HISAT2. Reads were partitioned into cell types according to annotations from the original publications. Per cell type, the 5' end of alignments were summarized, counts were merged to the mode with 30 nts, and finally filtered to a minimum threshold of 5 TPM. The resulting BED files identify the cell type cluster in the name column and the number of observed reads in the score column.The GTF files are augmentations of GENCODE vM25 and v39, using novel cleavage sites, and then truncated to 500 nt. In brief, the sites provided in the BED files were harmonized across cell types by merging to the mode within 30 nts. The candidate sites were then serially classified as (1) "validated" if already in GENCODE (2) "supported" if found in PolyASite2.0 at 3 TPM or higher (3) "likely" if cleanUpdTSeq scored the posterior probability of being an internal priming site below 0.0001% (4) "unlikely", otherwise. The "supported" and "likely" candidates were then used to augment GENCODE annotations of protein coding transcripts, and each transcript was truncated to the 500 nts at the 3' end. The final annotations identify the regions where the scUTRquant pipeline will quantify scRNA-seq data.Data GenerationAll code required to generate these files is available at:https://github.com/Mayrlab/mca-utrome (https://doi.org/10.5281/zenodo.8118416)https://github.com/Mayrlab/hcl-utrome (https://doi.org/10.5281/zenodo.8118411)
概述:本数据集包含用于表示切割位点与3'非翻译区(3'UTR)异构体注释的BED格式文件(BED)和GTF注释文件(GTF),其数据源自对Microwell-seq测序数据的重新处理。本数据集是验证Fansler等人2023年发表于bioRxiv的分析结果所需的最小数据集的组成部分。
描述:BED格式文件包含来自小鼠细胞图谱(Mouse Cell Atlas)和人类细胞景观(Human Cell Landscape)数据集的候选切割位点。简要分析流程如下:当双端读段存在重叠时,使用PEAR工具进行拼接,通过umi_tools提取细胞条形码,利用cutadapt去除序列的poly(A)尾,随后使用HISAT2将剩余读段比对至hg38或mm10参考基因组。依据原始文献提供的注释信息,将读段分配至对应细胞类型。针对每个细胞类型,对比对读段的5'端进行汇总,以30 nt为窗口将计数合并至众数位点,最终以5 TPM(转录本每百万计数)作为最小阈值完成过滤。最终生成的BED文件中,name列标注对应的细胞类型簇,score列记录观测到的读段数量。
GTF注释文件(GTF)是基于GENCODE vM25和v39版本的注释文件,结合新鉴定的切割位点进行扩充后,再将转录本截断至500 nt长度。具体流程为:首先对BED文件中提供的切割位点进行跨细胞类型标准化整合,即以30 nt为窗口将位点合并至众数位置。随后对候选位点依次分为以下类别:(1) 若该位点已存在于GENCODE数据库中,则标记为「已验证(validated)」;(2) 若该位点在PolyASite2.0数据库中且表达量≥3 TPM(转录本每百万计数),则标记为「已支持(supported)」;(3) 若cleanUpdTSeq工具预测该位点为内部引发位点的后验概率低于0.0001%,则标记为「大概率可信(likely)」;(4) 其余情况则标记为「大概率不可信(unlikely)」。随后将「已支持」和「大概率可信」的候选位点用于扩充编码蛋白转录本的GENCODE注释,并将每个转录本的3'端区域截断至500 nt长度。最终的注释文件将标识出scUTRquant分析流程用于定量scRNA-seq(单细胞RNA测序)数据的目标区域。
数据生成:生成这些文件所需的全部代码可通过以下链接获取:https://github.com/Mayrlab/mca-utrome(https://doi.org/10.5281/zenodo.8118416);https://github.com/Mayrlab/hcl-utrome(https://doi.org/10.5281/zenodo.8118411)
创建时间:
2024-04-01



