Re-evaluating deep neural networks for phylogeny estimation: the issue of taxon sampling

Name: Re-evaluating deep neural networks for phylogeny estimation: the issue of taxon sampling
Creator: Dryad
Published: 2025-06-01 02:02:59
License: 暂无描述

DataCite Commons2025-06-01 更新2025-06-15 收录

下载链接：

https://datadryad.org/dataset/doi:10.5061/dryad.rbnzs7h91

下载链接

链接失效反馈

官方服务：

资源简介：

Deep neural networks (DNNs) are powerful machine learning models that are widely used for classification problems, and have been recently proposed for quartet tree phylogeny estimation (Survorov et al. Systematic Biology 2020 and Zou et al. Molecular Biology and Evolution 2020). Here we present a study evaluating recently trained DNNs (from Zou et al., MBE 2020) in comparison to a collection of standard phylogeny estimation methods, including UPGMA, neighbor joining, maximum parsimony, and maximum likelihood, on a heterogeneous collection of 20-sequence datasets simulated under the same models that were used to train the DNNs, and also under similar conditions but with higher rates of evolution. Our study shows that using DNNs with quartet amalgamation (to combine quartet trees into a tree on the full dataset) is only more accurate than UPGMA, and otherwise is less accurate than all standard phylogeny estimation methods we explore (maximum likelihood, neighbor joining, and maximum parsimony). We further find that while DNNs can provide good quartet tree accuracy, some standard phylogeny estimation methods match or improve on DNNs for quartet accuracy, especially, but not exclusively, when used in a global manner (i.e., the tree on the full dataset is computed and then the induced quartet trees are extracted from the full tree). Thus, our study provides evidence that a major challenge impacting the utility of current DNNs for phylogeny estimation is their restriction to estimating quartet trees which must subsequently be combined into a tree on the full dataset: in contrast, global methods -- i.e., those that estimate trees from the full set of sequences -- are able to benefit from taxon sampling, and hence have higher accuracy on large datasets.

提供机构：

Dryad

创建时间：

2020-08-27

5,000+

优质数据集

54 个

任务类型

进入经典数据集