Scripts and data sets associated with: On testing homogeneity of the evolutionary process using alignments of homologous sequences

NIAID Data Ecosystem2026-05-02 收录

下载链接：

http://datadryad.org/dataset/doi%253A10.5061%252Fdryad.n8pk0p2xv

下载链接

链接失效反馈

官方服务：

资源简介：

In 2019, Genome Biology and Evolution (11:3341-3352) published three statistical tests for assessing whether alignments of genome sequences violate the phylogenetic assumption of evolution under homogeneous conditions. The new tests extend the matched-pairs tests of symmetry, marginal symmetry, and internal symmetry for alignments of n = 2 homologous sequences of nucleotides or amino acids to cases where alignments of n > 2 sequences are considered. Here we discuss the limitations of these new tests and then outline alternative approaches, which permit formal testing of multiple hypotheses (i.e., by controlling either the family-wise error rate or the false discovery rate). We show that the other approaches provide much greater insight into variation of the evolutionary process across lineages, via informal graphical methods and formal statistical procedures. Using one of the procedures (i.e., the Bonferroni test), we show that evolution under heterogeneous conditions is more prevalent than reported in the paper cited above and that the power of the matched-pairs tests of homogeneity is linked to the number of variant sites in an alignment. We release a new version of Homo, a program that allows for formal testing of multiple hypotheses and calculation of adjusted P values. Using Homo, we analysed an alignment of amino acids encoded by 116 flavivirus genomes, and reveal that these viral genomes are unlikely to have evolved under homogeneous conditions. To our knowledge, this is the first time that this has been reported for medically important Flavivirus genomes. Methods This submission contains batch scripts, sequence data, and protocols describing what was done in five computer-based experiments outlined in the manuscript with the above-mentioned title. The format of the submission was chosen such that it is as FAIR compliant as possible (i.e., that the data are Findable, Accessible, Interoperable, and Resuasable). In other words, the research done in the five experiment is reproducible. The content in folder Experiment_1 relates to the analysis of the performance of the Maximum Symmetry test. In particular, the experiment was designed to ascertain whether the edge lengths of a tree have an impact on the Type I error rate. The file named 00_README, describe the method used. The content in folder Experiment_2 relates to the analysis of the performance of the Maximum Symmetry test. In particular, the experiment was designed to ascertain whether the edge lengths of a tree have an impact on the Type II error rate. The file named 00_README, describe the method used. The content in folder Experiment_3 relates to the analysis of the performance of the Maximum Symmetry test. In particular, the experiment was designed to ascertain whether alignment length has an impact on the type II error rate. The file named 00_README, describe the method used. The content in folder Experiment_4 relates to the analysis of the performance of four tests concerning the global null hypothesis (H_G) of evolution under SRH conditions. In this case, the tests considered are Bonferroni's (1936) test, Hommel's (1983) test, Simes' (1986) test, and Naser-Khdour et al's (2019) test. The file named 00_README, describe the method used. The content in folder Experiment_5 relates to the analysis of the performance of four tests concerning the global null hypothesis (H_G) of evolution under SRH conditions. In this case, the tests considered are Bonferroni's (1936) test, Hommel's (1983) test, Simes' (1986) test, and Naser-Khdour et al's (2019) test. As for the genome data used in Experiment_2, we note: The policistronically-encoded amino-acid sequences of 116 flavivirus genomes were retrieved from GenBank (https://www.ncbi.nlm.nih.gov/genbank/) and aligned using MAFFT v7.453 (using ginsi mode) (Mol. Biol. Evol., 30:772-780). The completeness of the alignment was surveyed using AliStat v1.13 (NAR Genom. Bioinf., 2:lqaa024) and sites containing ambiguous characters were deleted if the proportion of such characters exceeded 0.2. The resulting alignment of 3,367 sites had a completeness score (Ca) of 0.9880 (for details, see NAR Genom. Bioinf., 2:lqaa024). Model selection was done using ModelFinder (Nat. Methods, 14:587-589), which is implemented in IQ-TREE2 v2.1.3 (Mol. Biol. Evol., 37:1530-1534). We only considered substitution models for viral polypeptides. For each model of sequence evolution considered, tree space was searched under the AIC and BIC optimality criteria. The rtREV+FO+I+R9 model was optimal under the AIC and the rtREV+FO+I+R7 model was optimal under the BIC. Using IQ-TREE2, the same tree was identified under the two models. The UFBoot2 procedure (Mol. Biol. Evol., 35:518-522) was used to assess the consistency of the phylogenetic signal.

创建时间：

2024-05-06

5,000+

优质数据集

54 个

任务类型

进入经典数据集