Protein tertiary structure is no better than poorly modelled amino acid sequences at resolving bilaterian relationships
收藏Figshare2025-11-27 更新2026-04-28 收录
下载链接:
https://figshare.com/articles/dataset/Protein_tertiary_structure_is_no_better_than_poorly_modelled_amino_acid_sequences_at_resolving_bilaterian_relationships/30519230
下载链接
链接失效反馈官方服务:
资源简介:
The exponential increase in the availability of protein tertiary structures has re-ignited interest in their suitability for phylogenetic inference. Despite claims that this data type vastly outperforms primary sequences in resolving deep-time relationships, they have not held outside of paralog-rich protein family tree inference. Subsequent work on a standard, though taxon-poor, single-copy ortholog data set found that sequence-only data recovered considerably more uncontested metazoan relationships than the structure-only data. Here, we have extended these experiments to a taxon-rich data set, compared sequence-only, structure-only and combined data sets under model-based maximum likelihood tree inference, using supermatrix and supertree approaches. We also compared the performance of the model and distance-based structure-only tree inferences. We found that, even with minimal mitigation of sources of error, species trees inferred from sequence-only data were more similar to the canonical metazoan tree than all structure-aware trees, with conflicts between the canonical and inferred sequence trees explained by well-known cases of systematic error due to long-branch attraction (LBA). Within the structure-only analyses, which were just as prone to LBA, we found the neighbour-joining trees inferred from the 3Di-based Fident distance matrices to outperform the model-based analyses of 3Di “sequences”. Thus, while promising and exciting, we do not yet have the methodological tool set that will enable us to routinely use tertiary protein structures in the context of single-copy ortholog phylogenomics.
蛋白质三级结构(protein tertiary structures)可获取量的指数级增长,重新唤起了学界对其适配系统发育推断(phylogenetic inference)的研究兴趣。尽管有观点认为该数据类型在解析深层演化关系(deep-time relationships)方面远优于一级序列(primary sequences),但这一结论仅在富含旁系同源基因的蛋白质家族树推断中成立,无法推广至其他场景。后续针对一项标准但类群覆盖度较低的单拷贝直系同源基因(single-copy ortholog)数据集开展的研究发现,仅使用序列数据所恢复的无争议后生动物(metazoan)演化关系,远多于仅使用结构数据的结果。本研究将此类实验拓展至类群丰富的数据集,采用超级矩阵法(supermatrix)与超级树法(supertree),在基于模型的最大似然树推断框架下,对比了仅序列数据、仅结构数据以及联合数据集的分析效果。此外,我们还对比了基于模型与基于距离的仅结构数据树推断方法的性能。研究结果显示,即便仅对误差来源进行了最小限度的校正,仅基于序列数据推断的物种树,与公认的后生动物标准树的相似度仍高于所有考虑了结构信息的树;标准树与序列推断树之间的冲突,可通过长枝吸引(long-branch attraction, LBA)导致的经典系统性误差案例得到解释。在同样易受长枝吸引影响的仅结构数据分析中,我们发现基于3Di的Fident距离矩阵(Fident distance matrices)所推断的邻接法树,其性能优于基于3Di"序列"的模型分析结果。综上,尽管蛋白质三级结构在系统发育研究中的应用前景光明且令人振奋,但目前我们仍缺乏能够支持常规在单拷贝直系同源基因系统发育基因组学(phylogenomics)场景下使用蛋白质三级结构的方法学工具集。
创建时间:
2025-11-27



