Cross-Over between Discrete and Continuous Protein Structure Space: Insights into Automatic Classification and Networks of Protein Structures

NIAID Data Ecosystem2026-03-06 收录

下载链接：

https://figshare.com/articles/dataset/Cross_Over_between_Discrete_and_Continuous_Protein_Structure_Space_Insights_into_Automatic_Classification_and_Networks_of_Protein_Structures/148176

下载链接

链接失效反馈

官方服务：

资源简介：

Structural classifications of proteins assume the existence of the fold, which is an intrinsic equivalence class of protein domains. Here, we test in which conditions such an equivalence class is compatible with objective similarity measures. We base our analysis on the transitive property of the equivalence relationship, requiring that similarity of A with B and B with C implies that A and C are also similar. Divergent gene evolution leads us to expect that the transitive property should approximately hold. However, if protein domains are a combination of recurrent short polypeptide fragments, as proposed by several authors, then similarity of partial fragments may violate the transitive property, favouring the continuous view of the protein structure space. We propose a measure to quantify the violations of the transitive property when a clustering algorithm joins elements into clusters, and we find out that such violations present a well defined and detectable cross-over point, from an approximately transitive regime at high structure similarity to a regime with large transitivity violations and large differences in length at low similarity. We argue that protein structure space is discrete and hierarchic classification is justified up to this cross-over point, whereas at lower similarities the structure space is continuous and it should be represented as a network. We have tested the qualitative behaviour of this measure, varying all the choices involved in the automatic classification procedure, i.e., domain decomposition, alignment algorithm, similarity score, and clustering algorithm, and we have found out that this behaviour is quite robust. The final classification depends on the chosen algorithms. We used the values of the clustering coefficient and the transitivity violations to select the optimal choices among those that we tested. Interestingly, this criterion also favours the agreement between automatic and expert classifications. As a domain set, we have selected a consensus set of 2,890 domains decomposed very similarly in SCOP and CATH. As an alignment algorithm, we used a global version of MAMMOTH developed in our group, which is both rapid and accurate. As a similarity measure, we used the size-normalized contact overlap, and as a clustering algorithm, we used average linkage. The resulting automatic classification at the cross-over point was more consistent than expert ones with respect to the structure similarity measure, with 86% of the clusters corresponding to subsets of either SCOP or CATH superfamilies and fewer than 5% containing domains in distinct folds according to both SCOP and CATH. Almost 15% of SCOP superfamilies and 10% of CATH superfamilies were split, consistent with the notion of fold change in protein evolution. These results were qualitatively robust for all choices that we tested, although we did not try to use alignment algorithms developed by other groups. Folds defined in SCOP and CATH would be completely joined in the regime of large transitivity violations where clustering is more arbitrary. Consistently, the agreement between SCOP and CATH at fold level was lower than their agreement with the automatic classification obtained using as a clustering algorithm, respectively, average linkage (for SCOP) or single linkage (for CATH). The networks representing significant evolutionary and structural relationships between clusters beyond the cross-over point may allow us to perform evolutionary, structural, or functional analyses beyond the limits of classification schemes. These networks and the underlying clusters are available at http://ub.cbm.uam.es/research/ProtNet.php

蛋白质的结构分类基于折叠（fold）这一核心概念，折叠是蛋白质结构域（protein domains）的内在等价类。本研究旨在探究在何种条件下，该等价类与客观相似性度量标准兼容。我们基于等价关系的传递性开展分析，即若A与B相似、B与C相似，则A与C亦必然相似。趋异基因演化使我们预期，该传递性应当近似成立。然而，若如多位学者所提出的那样，蛋白质结构域由反复出现的短多肽片段组合而成，则局部片段的相似性可能破坏传递性，转而支持蛋白质结构空间的连续性观点。我们提出一种量化方法，用于衡量聚类算法（clustering algorithm）将元素归为簇时对传递性的违背情况，研究发现此类违背存在明确且可检测的交叉点：在高结构相似性区间，传递性近似成立；而在低相似性区间，则出现大量传递性违背，且序列长度差异显著。我们认为，蛋白质结构空间是离散的，层级分类方法在该交叉点之前是合理的；而在相似性更低的区间，结构空间呈现连续性，此时应采用网络形式进行表征。我们针对自动分类流程中的所有可选环节——包括结构域分解、比对算法、相似性评分以及聚类算法——对该度量的定性表现进行了测试，结果表明该表现具有较强的鲁棒性。最终的分类结果依赖于所选算法。我们借助聚类系数与传递性违背值，从测试的候选方案中筛选出最优选择。有趣的是，该筛选标准同时也能提升自动分类结果与专家分类结果的一致性。本研究选用的结构域集合，是在SCOP（Structural Classification of Proteins）与CATH（Class, Architecture, Topology, Homology）中具有高度一致分解结果的2890个结构域的共识集合。比对算法方面，我们采用了本团队开发的全局版MAMMOTH，兼具快速性与准确性。相似性度量则选用了归一化长度的接触重叠度（size-normalized contact overlap），聚类算法采用了平均连接聚类法（average linkage）。在交叉点处得到的自动分类结果，相较于专家分类结果，与结构相似性度量的一致性更高：86%的簇对应SCOP或CATH超家族的子集，且仅有不到5%的簇同时包含SCOP与CATH分类中属于不同折叠的结构域。近15%的SCOP超家族与10%的CATH超家族被拆分，这与蛋白质演化过程中折叠发生改变的观点相符。尽管我们未尝试使用其他团队开发的比对算法，但所有测试过的可选方案均能得到定性一致的结果。SCOP与CATH中定义的折叠，在传递性违背程度较高、聚类任意性较强的区间会被完全合并。与此一致的是，SCOP与CATH在折叠层级上的一致性，低于二者分别与采用平均连接聚类（对应SCOP）或单连接聚类（single linkage，对应CATH）得到的自动分类结果的一致性。跨越交叉点后，簇之间存在具有重要意义的演化与结构关联，基于此类关联构建的网络可帮助我们突破传统分类框架的限制，开展演化、结构或功能相关分析。本研究的网络与基础簇数据可通过http://ub.cbm.uam.es/research/ProtNet.php获取。

创建时间：

2009-03-27

5,000+

优质数据集

54 个

任务类型

进入经典数据集