Machine learning can be as good as maximum likelihood when reconstructing phylogenetic trees and determining the best evolutionary model on four taxon alignments

NIAID Data Ecosystem2026-05-01 收录

下载链接：

http://datadryad.org/dataset/doi%253A10.5061%252Fdryad.ksn02v783

下载链接

链接失效反馈

官方服务：

资源简介：

Machine learning can be as good as maximum likelihood when reconstructing phylogenetic topologies and determining the best evolutionary model on four taxon alignments. Phylogenetic tree reconstruction with molecular data is important in many fields of life science research. The gold standard in this discipline is the Maximum Likelihood tree reconstruction method. Here we show that for quartet trees, Machine Learning using neural networks can be as good as the Maximum Likelihood method to infer the best tree topology and the best model of sequence evolution for nucleotide as well as amino acid sequences. For this purpose we simulated data sets for a wide range of branch lengths, evolutionary models and model parameters and compared the topologies and inferred models obtained with Machine learning with those obtained with the Maximum Likelihood and the Neighbour Joining method. Our results show that neural networks are a promising avenue for determining relatedness between taxa, which is likely to accelerate the construction of phylogenetic trees in the future, while maintaining a high accuracy. Methods This archive is part of the DeepNNPhylogeny project: DeepNNPhylogeny, for which the code of the software is available on GitHub. It contains pre-trained neural networks to predict (a) the best models of sequence evolution and (b) the best quartet tree topologies for alignments of four nucleotide or amino acid sequences. For each use case, six neural networks with different architectures have been trained and saved for further usage with the Python library TensorFlow. Neural networks have been saved with the tf.keras.Model.save function in the so-called Tensorflow SavedModel format. All neural networks have been trained with a large number of alignments simulated with the software PolyMoSim v1.1.4, which is available on GitHub. For each simulated data set, model parameters (including proportion of invariant sites, shape parameter of gamma distribution for site heterogeneity, transition/transversion ratio - if applicable, nucleotide base frequencies - if applicable, relative substitution rates - if applicable) and branch lengths have been chosen by a random number generator in specified intervals. While nucleotide alignments have been simulated with the JC (Jukes-Cantor 1969), F81 (Felsenstein 1981), F84 (Felsenstein 1984), K2P (Kimura two parameter), HKY (Hasegawa-Kishino-Yano, 1985) or the GTR (general time reversible) model, amino acid alignments have been simulated with the Dayhoff, JTT, LG or the WAG model. These models are available for model prediction and for topology prediction. For more details on the simulation and training procedure, see the publication (will be available soon).

创建时间：

2023-06-29