Integrated Proteomic Pipeline Using Multiple Search Engines for a Proteogenomic Study with a Controlled Protein False Discovery Rate

NIAID Data Ecosystem2026-03-09 收录

下载链接：

https://figshare.com/articles/dataset/Integrated_Proteomic_Pipeline_Using_Multiple_Search_Engines_for_a_Proteogenomic_Study_with_a_Controlled_Protein_False_Discovery_Rate/3793389

下载链接

链接失效反馈

官方服务：

资源简介：

In the Chromosome-Centric Human Proteome Project (C-HPP), false-positive identification by peptide spectrum matches (PSMs) after database searches is a major issue for proteogenomic studies using liquid-chromatography and mass-spectrometry-based large proteomic profiling. Here we developed a simple strategy for protein identification, with a controlled false discovery rate (FDR) at the protein level, using an integrated proteomic pipeline (IPP) that consists of four engrailed steps as follows. First, using three different search engines, SEQUEST, MASCOT, and MS-GF+, individual proteomic searches were performed against the neXtProt database. Second, the search results from the PSMs were combined using statistical evaluation tools including DTASelect and Percolator. Third, the peptide search scores were converted into E-scores normalized using an in-house program. Last, ProteinInferencer was used to filter the proteins containing two or more peptides with a controlled FDR of 1.0% at the protein level. Finally, we compared the performance of the IPP to a conventional proteomic pipeline (CPP) for protein identification using a controlled FDR of <1% at the protein level. Using the IPP, a total of 5756 proteins (vs 4453 using the CPP) including 477 alternative splicing variants (vs 182 using the CPP) were identified from human hippocampal tissue. In addition, a total of 10 missing proteins (vs 7 using the CPP) were identified with two or more unique peptides, and their tryptic peptides were validated using MS/MS spectral pattern from a repository database or their corresponding synthetic peptides. This study shows that the IPP effectively improved the identification of proteins, including alternative splicing variants and missing proteins, in human hippocampal tissues for the C-HPP. All RAW files used in this study were deposited in ProteomeXchange (PXD000395).

在以染色体为中心的人类蛋白质组计划（Chromosome-Centric Human Proteome Project, C-HPP）中，数据库搜索后通过肽谱匹配（peptide spectrum matches, PSMs）得到的假阳性鉴定结果，是基于液相色谱与质谱联用的大规模蛋白质组分析开展蛋白质基因组学研究的主要瓶颈问题。本研究开发了一种可在蛋白质水平控制错误发现率（false discovery rate, FDR）的蛋白质鉴定简易策略，通过包含以下四个核心步骤的整合蛋白质组分析流程（integrated proteomic pipeline, IPP）实现：首先，采用SEQUEST、MASCOT及MS-GF+三款不同的搜索引擎，针对neXtProt数据库（neXtProt）进行独立的蛋白质组学搜索；其次，利用DTASelect与Percolator等统计评估工具，合并肽谱匹配的搜索结果；第三，通过自研程序将肽段搜索得分转换为标准化的E值（E-scores）；最后，使用ProteinInferencer工具筛选包含至少两条独特肽段的蛋白质，并将蛋白质水平的错误发现率控制在1.0%。随后，本研究将整合蛋白质组分析流程（IPP）与常规蛋白质组分析流程（conventional proteomic pipeline, CPP）的性能进行对比，后者同样将蛋白质水平错误发现率控制在<1%。针对人体海马组织样本，采用IPP共鉴定到5756种蛋白质（CPP仅鉴定到4453种），其中包含477个可变剪接变体（alternative splicing variants，CPP仅为182个）。此外，本研究通过IPP还鉴定到10种符合至少两条独特肽段标准的缺失蛋白质（missing proteins，CPP仅为7种），并通过公共数据库的串联质谱（MS/MS）谱图模式或对应合成肽段（synthetic peptides）对这些胰蛋白酶肽段进行了验证。本研究表明，针对C-HPP的人体海马组织样本，整合蛋白质组分析流程（IPP）可有效提升蛋白质、可变剪接变体及缺失蛋白质的鉴定效率。本研究所用的全部RAW文件已上传至蛋白质组交换库（ProteomeXchange, PXD000395）。

创建时间：

2016-10-31

5,000+

优质数据集

54 个

任务类型

进入经典数据集