Re-evaluating deep neural networks for phylogeny estimation: the issue of taxon sampling

NIAID Data Ecosystem2026-03-12 收录

下载链接：

http://datadryad.org/dataset/doi%253A10.5061%252Fdryad.rbnzs7h91

下载链接

链接失效反馈

官方服务：

资源简介：

Deep neural networks (DNNs) are powerful machine learning models that are widely used for classification problems, and have been recently proposed for quartet tree phylogeny estimation (Survorov et al. Systematic Biology 2020 and Zou et al. Molecular Biology and Evolution 2020). Here we present a study evaluating recently trained DNNs (from Zou et al., MBE 2020) in comparison to a collection of standard phylogeny estimation methods, including UPGMA, neighbor joining, maximum parsimony, and maximum likelihood, on a heterogeneous collection of 20-sequence datasets simulated under the same models that were used to train the DNNs, and also under similar conditions but with higher rates of evolution. Our study shows that using DNNs with quartet amalgamation (to combine quartet trees into a tree on the full dataset) is only more accurate than UPGMA, and otherwise is less accurate than all standard phylogeny estimation methods we explore (maximum likelihood, neighbor joining, and maximum parsimony). We further find that while DNNs can provide good quartet tree accuracy, some standard phylogeny estimation methods match or improve on DNNs for quartet accuracy, especially, but not exclusively, when used in a global manner (i.e., the tree on the full dataset is computed and then the induced quartet trees are extracted from the full tree). Thus, our study provides evidence that a major challenge impacting the utility of current DNNs for phylogeny estimation is their restriction to estimating quartet trees which must subsequently be combined into a tree on the full dataset: in contrast, global methods -- i.e., those that estimate trees from the full set of sequences -- are able to benefit from taxon sampling, and hence have higher accuracy on large datasets.

深度神经网络（Deep Neural Networks, DNNs）是一类性能优异的机器学习模型，被广泛应用于分类任务，近年来也被提出用于四元树系统发育估计（quartet tree phylogeny estimation）领域（Survorov等，《系统生物学》2020；Zou等，《分子生物学与进化》2020）。本研究开展了一项对比评估，以Zou等人2020年发表于《分子生物学与进化》的经训练深度神经网络为测试对象，与一系列经典系统发育估计方法（包括UPGMA、邻接法（neighbor joining）、最大简约法（maximum parsimony）、最大似然法（maximum likelihood））进行对比。实验所用数据集为异质性20序列模拟集，一部分采用与训练深度神经网络一致的进化模型生成，另一部分则在相似条件下设置了更高的进化速率。本研究结果表明，结合四元树合并（quartet amalgamation，即将四元树整合为全数据集物种树）的深度神经网络，仅在性能上优于UPGMA，而在其余所有本次评估的经典系统发育估计方法中均表现更差。进一步分析发现，尽管深度神经网络可实现较高的四元树推断精度，但部分经典系统发育估计方法在四元树精度上可与之持平甚至更优——这一优势尤其但不限于体现在全局推断场景中：即先基于全序列数据集构建物种树，再从中提取诱导四元树。综上，本研究证实，当前限制深度神经网络用于系统发育估计的核心挑战在于其仅能独立推断四元树，后续仍需通过合并操作得到全物种树；与之相对，全局方法（即直接基于全序列数据集推断物种树的方法）可充分受益于类群采样策略，因此在大规模数据集上具备更高的推断精度。

创建时间：

2020-08-26

5,000+

优质数据集

54 个

任务类型

进入经典数据集