Data from: APPLES: Scalable distance-based phylogenetic placement with or without alignments

Mendeley Data2024-04-12 更新2024-06-27 收录

下载链接：

https://datadryad.org/stash/dataset/doi:10.5061/dryad.78nf7dq

下载链接

链接失效反馈

官方服务：

资源简介：

Assembly-free phylogenetic placement analysis on Lice This is a set of 61 genome-skims by Boyd et al. (2017), including 45 known lice species (some represented multiple times) and 7 undescribed species. We generate lower coverage skims of 0.1Gb or 0.5Gb by randomly subsampling the reads from the sequence read archives (SRA) provided by the original publication (NCBI BioProject PRJNA296666). We use BBTools (Bushnell, 2014) to filter subsampled reads for adapters and contaminants and remove duplicated reads. Due to their large size, we include genome sketches generated by Skmer in this dataset. Since this dataset is not assembled, the coverage of the genome-skims is unknown; Skmer estimates the coverage to be between 0.2X and 1X for 0.1Gb samples (and 5 times that coverage with 0.5Gb). This dataset also includes an ML concatenation tree previously published by Boyd et. al 2017, scripts used in the data preparation, and placement trees output by APPLES. lice.tar.gz Simulated gene alignments based on GTR model This package includes a 101-taxon dataset, previously made available from Mirarab and Warnow 2015. Sequences were simulated under the General Time Reversible (GTR) plus the Γ model of site rate heterogeneity using INDELible (Fletcher and Yang, 2009) on gene trees that were simulated using SimPhy (Mallo et al., 2016) under the coalescent model evolving on species trees generated under the Yule model. We took all 20 replicates of this dataset with mutation rates between 5 × 10−8 and 2 × 10−7, and for each replicate, randomly selected five estimated gene trees among those with 20% RF distance between estimated and true gene tree. Thus, we have a total of 100 backbone trees. The package includes estimated trees, the leave-one-out phylogenetic placement experiment files that made into the paper, and the scripts used in generating the data and running the experiments. gtr.tar.bz2 Full RNAsim simulation data Guo et al. 2009 designed a complex model of RNA evolution that does not make usual i.i.d assumptions of sequence evolution. Instead, it uses models of energy of the secondary structure to simulate RNA evolution by a mutation-selection population genetics model. This model is based on an inhomogeneous stochastic process without a global substitution matrix. This is an RNASim dataset of one million 227 sequences (with E.coli SSU rRNA used as the root), which consists of a multiple sequence alignment and true phylogeny. full-RNAsim-simulation-files.tar.bz2 RNASim Heterogeneous dataset We first randomly subsampled the full dataset to create 10 datasets of size 10,000. Then, we chose the largest clade of size at most 250 from replicate; this gives us 10 backbone trees of mean size 249. RNASim-AE: Estimated alignment dataset Alignment Error (RNASim-AE) dataset. Mirarab et al. (2015) used PASTA to estimate alignments on subsets of the RNASim dataset with up to 200,000 sequences. This dataset contains their reported alignment with 200,000 or 10,000 sequences (taking only replicate 1 in this case) and experiment data&scripts on this dataset. estimated-alignment-data.tar.bz2 RNASim-QS: Query scalability Dataset We first randomly subsampled the full RNASim dataset to create a dataset of size 500. Then for k =1 to 49,152 queries (choosing all k = 3 × 2i, 0 <= i <= 14) we created 5 replicates of k query sequences, again randomly subsampling from the full alignment with one million sequences. This dataset includes backbone trees, backbone alignments, query alignments, placement trees, time measurements, and the scripts used in this experiment. query-scalability.tar.bz2 RNASim-VS: Varied Size Dataset We randomly subsampled the full RNASim dataset to create 5 replicates of datasets of size (n): 500, 1000, 5000, 10000, 50000, and 100000, and 1 replicate (due to size) of size 200000. For replicates that contain at least 5000 species, we removed sites that contain gaps in 95% or more of the sequences in the alignment. This dataset includes backbone alignment, backbone trees, query sequences, and scripts used in performing the experiment. variable-size.tar.bz2 Varied diameter RNASim dataset To evaluate the impact of the evolutionary diameter (i.e., the highest distance between any two leaves in the backbone), we also created datasets withow, medium, and high diameters. We sampled the largest five clades of size at most 250 from each of the 10 replicates used for the heterogeneous dataset. Among these 50 clades, we picked the bottom, middle, and top five clades in diameter, which had diameter in [0.3, 0.4] (mean: 0.36), [0.5, 0.52] (mean: 0.51), and [0.65, 1.07] (mean: 0.82), respectively. varied-diameter.tar.bz2

创建时间：

2023-06-28

5,000+

优质数据集

54 个

任务类型

进入经典数据集