Enhanced Protein Isoform Characterization Through Long-Read Proteogenomics - Workflow Results

NIAID Data Ecosystem2026-05-02 收录

下载链接：

https://zenodo.org/record/5920919

下载链接

链接失效反馈

官方服务：

资源简介：

The detection of physiologically relevant protein isoforms encoded by the human genome is critical to biomedicine. Mass spectrometry (MS)-based proteomics is the preeminent method for protein detection, but isoform-resolved proteomic analysis relies on accurate reference databases that match the sample; neither a subset nor a superset database is ideal. Long-read RNA sequencing (e.g. PacBio, Oxford Nanopore) provides full-length transcript sequencing, which can be used to predict full-length proteins. Here, we describe a long-read proteogenomics approach for integrating matched long-read RNA-seq and MS-based proteomics data to enhance isoform characterization. We introduce a classification scheme for protein isoforms, discover novel protein isoforms, and present the first protein inference algorithm for the direct incorporation of long-read transcriptome data in protein inference to enable detection of protein isoforms that are intractable to MS detection. We have released an open-source Nextflow pipeline that integrates long-read sequencing in a proteomic workflow for isoform-resolved analysis. Companion Repositories: Long-Read-Proteogenomics Workflow GitHub Repository Release Long-Read-Proteogenomics Analysis GitHub Repository Release Companion Datasets Long-Read-Proteogenomics Workflow Sample and Reference Data TEST Data for Long-Read-Proteogenomics Workflow GitHub Actions This Repository contains the complete output from the execution of the Long-Read-Proteogenomics Workflow, using the input from Jurkat Samples and Reference Data. The file jurkat.flnc.bam was 6.5 GB had to be split into 13 separate files and for use should be rejoined -- here are the steps that were used to split the file up. 1. Convert jurkat.flnc.bam (binary format) to sam file (text format) without header: samtools view jurkat.flnc.bam > jurkat.flnc.sam 2. Capture the header: samtools view -H jurkat.flnc.bam > jurkat.flnc.header.sam 3. Split jurkat.flnc.sam into smaller files (aim to get final size under 2GB): split -l 400000 jurkat.flnc.sam jurkat.flnc.chunk. 4. Convert each of these files back to bam for uploading: samtools view -b jurkat.flnc.chunk.a* -o jurkat.flnc.chunk.a*.bam (*=a,b,c,d,e,f,g,h,i,j,k,l,m) After downloading, reverse this process including using the header file which is found in the LRPG-Manuscript-Results-results-results-jurkat-isoseq3-companion-files.tar.gz file> 1. Convert the bam files back to sam files: samtools view jurkat.flnc.chunk.a*.bam > jurkat.flnc.chunk.a*.sam (*=a,b,c,d,e,f,g,h,i,j,k,l,m) 2. Combine the header together with the sam files: cat jurkat.flnc.chunk.a*sam > jurkcat.flnc.sam (verified the same number of lines of the sam files is identical to the number of lines of the original without header: 4,956,761. Header file is 13 lines. 3. Convert to bam files if desired: samtools view -b jurkat.flnc.sam -o jurkat.flnc.bam 4. Rehead with the header file: samtools reheader -P -i jurkat.flnc.header.sam jurkat.flnc.bam

创建时间：

2024-07-17

5,000+

优质数据集

54 个

任务类型

进入经典数据集