five

Data sets for "ARTreeFormer: A Faster Attention-based Autoregressive Model For Phylogenetic Inference"

收藏
DataCite Commons2025-10-03 更新2026-02-09 收录
下载链接:
https://figshare.com/articles/dataset/Data_sets_for_ARTreeFormer_A_Faster_Attention-based_Autoregressive_Model_For_Phylogenetic_Inference_/30272299/1
下载链接
链接失效反馈
官方服务:
资源简介:
Description of the dataARTreeFormer is a deep autoregressive model for modeling the distribution of phylogenetic tree topologies. It significantly improves upon ARTree in terms of the training and sampling speed by leveraging the fixed-point numerical solver and the attention scheme. The tree topology density estimation (TDE) experiment is widely used to assess the estimation accuracy and speed for different probabilistic models for phylogenetic tree topologies. To perform this experiment, one needs first to train a model by fitting the training data of tree topologies with maximum likelihood estimation, and then evaluate the fitness (e.g., KL divergence) to the ground truth tree topologies. This repository contains the eight sets of training data of tree topologies (i.e.,"short run") and ground truth data of tree topologies (i.e., "golden run"), called DS1-8, respectively, for reproducing the TDE experiment in the ARTreeFormer paper.<br>These tree topologies are constructed by running MrBayes on the following eight sequence data. These data sets consist of sequences from 27 to 64 eukaryote species with 378 to 2520 site observations.<br>DS1, 27 taxa, 1949 sites, [Hedges et al. (1990)]DS2, 29 taxa, 2520 sites, [Garey et al. (2012)]DS3, 36 taxa, 1812 sites, [Yang and Yoder (2003)]DS4, 41 taxa, 1137 sites, [Henk et al. (2003)]DS5, 50 taxa, 378 sites, [Lakner et al. (2008)]DS6, 50 taxa, 1133 sites, [Zhang and Blackwell (2001)]DS7, 59 taxa, 1824 sites, [Yoder and Yang (2004)]DS8, 64 taxa, 1008 sites, [Rossman et al. (2001)]FilesFile: short_run_data_DS1-8<b>Description</b>: This file contains 8 sets of tree topologies as training data, and each set contains 10 replicates. To construct the training data set, we run MrBayes v3.2.7a on each sequence set with 10 replicates of 4 chains and 8 runs until the runs have ASDSF (the standard convergence criteria used in MrBayes) less than 0.01 or a maximum of 100 million iterations, collect the samples every 100 iterations, and discard the first 25%. For the Bayesian setting in MrBayes runs, we assume a uniform prior on the tree topologies, an i.i.d. exponential prior Exp(10) on branch lengths, and the simple Jukes &amp; Cantor (JC) substitution model.File: golden_run_data_DS1-8<b>Description</b>: This file contains 8 sets of tree topologies as ground truth data. For each sequence set, we run 10 extremely long single-chain MrBayes (v3.2.7a) runs, each for one billion iterations, where the samples are collected every 1000 iterations, with the first 25% discarded as burn-in. For the Bayesian setting in MrBayes runs, we assume a uniform prior on the tree topologies, an i.i.d. exponential prior Exp(10) on branch lengths, and the simple Jukes &amp; Cantor (JC) substitution model.SoftwareWe use MrBayes v3.2.7a to produce the training data and the ground truth data of tree topologies.
提供机构:
figshare
创建时间:
2025-10-03
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作