Evaluating the Impact of Different Sequence Databases on Metaproteome Analysis: Insights from a Lab-Assembled Microbial Mixture

NIAID Data Ecosystem2026-03-08 收录

下载链接：

https://figshare.com/articles/dataset/_Evaluating_the_Impact_of_Different_Sequence_Databases_on_Metaproteome_Analysis_Insights_from_a_Lab_Assembled_Microbial_Mixture_/871689

下载链接

链接失效反馈

官方服务：

资源简介：

Metaproteomics enables the investigation of the protein repertoire expressed by complex microbial communities. However, to unleash its full potential, refinements in bioinformatic approaches for data analysis are still needed. In this context, sequence databases selection represents a major challenge. This work assessed the impact of different databases in metaproteomic investigations by using a mock microbial mixture including nine diverse bacterial and eukaryotic species, which was subjected to shotgun metaproteomic analysis. Then, both the microbial mixture and the single microorganisms were subjected to next generation sequencing to obtain experimental metagenomic- and genomic-derived databases, which were used along with public databases (namely, NCBI, UniProtKB/SwissProt and UniProtKB/TrEMBL, parsed at different taxonomic levels) to analyze the metaproteomic dataset. First, a quantitative comparison in terms of number and overlap of peptide identifications was carried out among all databases. As a result, only 35% of peptides were common to all database classes; moreover, genus/species-specific databases provided up to 17% more identifications compared to databases with generic taxonomy, while the metagenomic database enabled a slight increment in respect to public databases. Then, database behavior in terms of false discovery rate and peptide degeneracy was critically evaluated. Public databases with generic taxonomy exhibited a markedly different trend compared to the counterparts. Finally, the reliability of taxonomic attribution according to the lowest common ancestor approach (using MEGAN and Unipept software) was assessed. The level of misassignments varied among the different databases, and specific thresholds based on the number of taxon-specific peptides were established to minimize false positives. This study confirms that database selection has a significant impact in metaproteomics, and provides critical indications for improving depth and reliability of metaproteomic results. Specifically, the use of iterative searches and of suitable filters for taxonomic assignments is proposed with the aim of increasing coverage and trustworthiness of metaproteomic data.

宏蛋白质组学（metaproteomics）可用于探究复杂微生物群落所表达的全套蛋白质组。然而，要充分发挥其应用潜力，仍需优化数据分析所用的生物信息学方法。在此背景下，序列数据库的选择是一项核心挑战。本研究以包含9种不同细菌与真核生物的模拟微生物混合物为研究对象，开展鸟枪法宏蛋白质组分析，以此评估不同数据库对宏蛋白质组学研究的影响。随后，分别对该模拟混合物及各单株微生物进行下一代测序，获得实验性宏基因组来源数据库与基因组来源数据库，并将其与按不同分类学层级整理的公共数据库（即NCBI、UniProtKB/SwissProt及UniProtKB/TrEMBL）联用，对宏蛋白质组数据集进行分析。首先，针对所有数据库的肽段鉴定数量与重叠情况开展定量对比分析。结果显示，仅35%的肽段为所有数据库类别所共有；此外，相较于通用分类学数据库，属/种特异性数据库的肽段鉴定率最高可提升17%；而宏基因组数据库相较于公共数据库也实现了小幅提升。随后，本研究对不同数据库在错误发现率（false discovery rate）与肽段简并性方面的表现进行了严谨评估。通用分类学公共数据库的变化趋势与其余数据库存在显著差异。最后，本研究针对基于最低共同祖先（lowest common ancestor）方法（采用MEGAN与Unipept软件）的分类学归因可靠性开展了评估。不同数据库的错配率存在差异，本研究建立了基于分类学特异性肽段数量的专属阈值，以最大限度减少假阳性结果。本研究证实，数据库选择对宏蛋白质组学研究具有显著影响，并为提升宏蛋白质组学研究结果的深度与可靠性提供了关键指导。具体而言，本研究建议采用迭代搜索策略与适用于分类学归因的筛选方法，以提升宏蛋白质组学数据的覆盖范围与可信度。

创建时间：

2013-12-09

5,000+

优质数据集

54 个

任务类型

进入经典数据集