five

Data from: ASTRAL: genome-scale coalescent-based species tree estimation

收藏
Mendeley Data2024-04-13 更新2024-06-27 收录
下载链接:
https://datadryad.org/stash/dataset/doi:10.5061/dryad.ht76hdrp0
下载链接
链接失效反馈
官方服务:
资源简介:
# ASTRAL: genome-scale coalescent-based species tree estimation This repository includes both simulated and biological dataset. ## Description of the data and file structure The following datasets are used in the ASTRAL paper shown above. All these archive files include README files that describe their content. ### biological.zip: This file includes: 1\. our estimated gene trees on alignments provided to us by authors of Song et al, 2012, PNAS, 2\. our estimated species trees on the same dataset. We have re-analyses of two biological datasets in our paper. #### Song et al dataset We obtained gene alignments from the Song et al and re-estimated gene trees and species trees. The following files are included in mammals.zip * mammals-alignments.zip contains all the alignments that we obtained from Song et al. * mammals-genetreess.zip contains gene trees that we estimated. For each gene, we include 3 files * RAxML_bipartitions.final.f200 is the bestML tree with support values drawn on it based on 200 bootstrap replicates. * RAxML_bootstrap.all includes 200 replicates of bootstrapping using RAxML * RAxML_bootstrap.all.extra is related to the gene resampling procedure. When gene resampling bootstrapping was used, some genes needed more than 200 bootstrap replicates. Those are included in RAxML_bootstrap.all.extra files (thus the first 200 replicates are same as RAxML_bootstrap.all, but some genes have more replicates). * `424.[mpest/astral].mlbs`: the species trees estimated based on these 424 gene trees. Note that the original Song et al dataset has 447 genes, but we removed 23 genes for reasons described in the paper. #### Chiari et al. dataset All the gene data related to this dataset are already available on the [Dyrad](http//datadryad.org/resource/doi/dryad.87b01fq0) ### truetrees.zip: The model species tree and the true gene trees simulated based on the mammalian dataset of Song et al, 2012, PNAS. The following files are available (all in newick format): * model-species-tree: The model species tree used for simulation. * The following are the true gene trees simulated using the coalescence process based on the model tree. Branches in the model species tree are multiplied by 2, 5, or divided by 2 and 5, to create alternative levels of ILS. * true-trees-1X * true-trees-scaled2down * true-trees-scaled2up * true-trees-scaled5down * true-trees-scaled5up ### sequencedata.zip Sequence data simulated on the true gene trees (mammalian dataset). All the simulated alignments for various levels of ILS are given here. Each zip file is a collection of alignments (`.fasta`), which make up the content of replicates. Only full alignments are given here. Alignments are trimmed into their first 500bp or 1000bp to create various model conditions with varying phylogenetic signals. ### estimatedgenetrees.zip This file gives gene trees estimated using RAxML on alignments of length 1000 and 500 (mammalian dataset). * All Gene trees, including their bootstrap replicates are provided in this file. The zip file contains a set of other zip files, each corresponding to a different ILS level. Each of these zip files consists of a collection of files with the following format: * `[gene id]`.`[alignment length]`-BestML.tre * `[gene id]`.`[alignment length]`-bp.MLBS.gz `BestML` is a newick tree file that contains the maximum likelihood tree returned by RAxMML (best of 10 runs). `MLBS.gz` includes the set of 200 bootstrap replicates for each gene (note these files are compressed using gzip and need to be uncompressed using gunzip). * `gene_ids.txt`: Note that model conditions in the paper are defined by the number of genes in addition to the ILS level and the alignment length. The gene trees provided here are the same for model conditions that differ only in the number of genes. Thus, a particular gene id can be used in model conditions with 25 genes, 50 genes, and so on. The gene ids assigned to each model condition are shown in `gene_ids.txt`. For example, gene id 2342 is used in replicate 12 of 200 genes model conditions, replicate 6 of 400 genes model conditions, and replicate 3 of 800 genes model condition. ## Sharing/Access information We acknowledge the help of [Bastien Boussau](https://sites.google.com/site/bastienboussau/) who performed these simulations for another study and made them available to us for this paper. ## Code/Software We used the script given here in the mammalian dataset:
创建时间:
2024-01-08
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作