Identification of Protein Isoforms Using Reference Databases Built from Long and Short Read RNA-Sequencing

NIAID Data Ecosystem2026-03-13 收录

下载链接：

https://figshare.com/articles/dataset/Identification_of_Protein_Isoforms_Using_Reference_Databases_Built_from_Long_and_Short_Read_RNA-Sequencing/19875163

下载链接

链接失效反馈

官方服务：

资源简介：

Alternative splicing can lead to distinct protein isoforms. These can have different functions in specific cells and tissues or in different developmental stages. In this study, we explored whether transcripts assembled from long read, nanopore-based, direct RNA-sequencing (RNA-seq) could improve the identification of protein isoforms in human K562 cells. By comparing with Illumina-based short read RNA-seq, we showed that a large proportion of Ensembl transcripts (5949/14,326) and genes expressing alternatively spliced transcripts (486/2981) identified with long direct reads were missed by short paired-end reads. By co-analyzing proteomic and transcriptomic data, we also showed that some peptides (826/35,976), proteins (262/3215), and protein isoforms arising from distinct transcript variants (574/1212) identified with isoform-specific peptides via custom long-read-based databases were missed in Illumina-derived databases. Finally, we generated unequivocal peptide evidence for a set of protein isoforms and showed that long read, direct RNA-seq allows the discovery of novel protein isoforms not already in reference databases or custom databases built from short read RNA-seq data. Our analysis highlights the benefits of long read RNA-seq data in the generation of reference databases to increase tandem mass spectrometry (MS/MS) identification of protein isoforms.

可变剪接可产生不同的蛋白质异构体（protein isoforms）。这些异构体可在特定细胞、组织或不同发育阶段发挥差异化功能。本研究旨在探究：基于纳米孔的长读长直接RNA测序（direct RNA-sequencing）所组装得到的转录本，能否提升人类K562细胞内蛋白质异构体的鉴定效果。本研究与基于Illumina的短读长RNA测序进行对照后发现，通过长读长直接测序鉴定得到的大量Ensembl数据库转录本（5949/14326）以及表达可变剪接转录本的基因（486/2981），均未被短读长双端测序技术检出。通过联合分析蛋白质组学与转录组学数据，本研究还证实：通过基于长读长的自定义数据库、利用异构体特异性肽段鉴定得到的部分肽段（826/35976）、蛋白质（262/3215）以及源自不同转录本变体的蛋白质异构体（574/1212），均未在基于Illumina数据构建的数据库中被发现。最后，本研究为一批蛋白质异构体提供了明确的肽段证据，并证明长读长直接RNA测序可发现尚未收录于参考数据库或基于短读长RNA测序数据构建的自定义数据库中的新型蛋白质异构体。本研究的分析结果凸显了长读长RNA测序数据在构建参考数据库方面的价值，该优势可提升串联质谱（tandem mass spectrometry, MS/MS）对蛋白质异构体的鉴定效能。

创建时间：

2022-05-25

5,000+

优质数据集

54 个

任务类型

进入经典数据集