Table_2_Whole Proteome Clustering of 2,307 Proteobacterial Genomes Reveals Conserved Proteins and Significant Annotation Issues.xlsx

NIAID Data Ecosystem2026-03-11 收录

下载链接：

https://figshare.com/articles/dataset/Table_2_Whole_Proteome_Clustering_of_2_307_Proteobacterial_Genomes_Reveals_Conserved_Proteins_and_Significant_Annotation_Issues_xlsx/7781489

下载链接

链接失效反馈

官方服务：

资源简介：

We clustered 8.76 M protein sequences deduced from 2,307 completely sequenced Proteobacterial genomes resulting in 707,311 clusters of one or more sequences of which 224,442 ranged in size from 2 to 2,894 sequences. To our knowledge this is the first study of this scale. We were surprised to find that no single cluster contained a representative sequence from all the organisms in the study. Given the minimal genome concept, we expected to find a shared set of proteins. To determine why the clusters did not have universal representation we chose four essential proteins, the chaperonin GroEL, DNA dependent RNA polymerase subunits beta and beta′ (RpoB/RpoB′), and DNA polymerase I (PolA), representing fundamental cellular functions, and examined their cluster distribution. We found these proteins to be remarkably conserved with certain caveats. Although the groEL gene was universally conserved in all the organisms in the study, the protein was not represented in all the deduced proteomes. The genes for RpoB and RpoB′ were missing from two genomes and merged in 88, and the sequences were sufficiently divergent that they formed separate clusters for 18 RpoB proteins (seven clusters) and 14 RpoB′ proteins (three clusters). For PolA, 52 organisms lacked an identifiable sequence, and seven sequences were sufficiently divergent that they formed five separate clusters. Interestingly, organisms lacking an identifiable PolA and those with divergent RpoB/RpoB′ were predominantly endosymbionts. Furthermore, we present a range of examples of annotation issues that caused the deduced proteins to be incorrectly represented in the proteome. These annotation issues made our task of determining protein conservation more difficult than expected and also represent a significant obstacle for high-throughput analyses.

本研究对2307个完全测序的变形菌门（Proteobacteria）基因组所推导得到的876万条蛋白质序列进行聚类，最终得到707311个包含至少一条序列的聚类簇，其中224442个簇的序列数量介于2至2894条之间。据我们所知，此类规模的相关研究尚属首例。我们意外发现，没有任何一个聚类簇包含本研究中所有受试生物的代表性序列。基于最小基因组概念，我们原本预期能够找到一套共有的核心蛋白质。为探究为何聚类簇无法实现全覆盖，我们选取了4种核心蛋白质：伴侣素（chaperonin）GroEL、依赖DNA的RNA聚合酶（DNA dependent RNA polymerase）β亚基与β'亚基（RpoB/RpoB'），以及DNA聚合酶I（DNA polymerase I，PolA），这些蛋白质均代表了细胞的核心生理功能，随后我们对它们在各聚类簇中的分布情况进行了分析。我们发现这些蛋白质整体保守性极强，但存在若干需注意的限定情形。尽管本研究中所有受试生物的groEL基因均保守存在，但并非所有推导得到的蛋白质组（proteome）中都能检测到该蛋白产物。RpoB与RpoB'的编码基因在2个基因组中缺失，在88个基因组中发生融合；且这些序列的差异度足够高，因此18条RpoB蛋白序列形成了7个独立聚类簇，14条RpoB'蛋白序列则形成了3个独立聚类簇。就PolA而言，有52个受试生物无法检测到可识别的同源序列，另有7条序列差异度足够高，形成了5个独立聚类簇。值得注意的是，无法检测到PolA同源序列的受试生物，以及携带差异型RpoB/RpoB'的受试生物，绝大多数为内共生生物（endosymbionts）。此外，我们还列举了若干注释异常案例，这些异常导致推导得到的蛋白质在蛋白质组中的标注存在错误。此类注释异常不仅增加了我们确定蛋白质保守性的工作难度，也为高通量分析带来了显著阻碍。

创建时间：

2019-02-28

5,000+

优质数据集

54 个

任务类型

进入经典数据集