five

The Dayhoff Exchange Score: A new metric to quantify site saturation in amino acid datasets prior to phylogenetic analysis

收藏
NIAID Data Ecosystem2026-05-10 收录
下载链接:
http://datadryad.org/dataset/doi%253A10.5061%252Fdryad.34tmpg4tm
下载链接
链接失效反馈
官方服务:
资源简介:
Entropic site saturation is a persistent problem in phylogenetic analyses, where it can hinder the accuracy of topology reconstruction. It is fundamentally caused by large amounts of independent change along branches, causing the model to be unable to distinguish phylogenetic signal from noise. The Dayhoff Exchange Score (DE-score) is a new metric to assess this form of site saturation within and between amino acid datasets, which provides both a whole dataset overview and taxon-specific values that represent the contribution of a given taxon to the whole dataset entropic site saturation. We first assess the efficacy of this score at detecting increased entropic site saturation over 20,000 simulation datasets, compare it to the existing Slope R2 score, and then assess its efficacy in the face of the potentially confounding factors of increasing taxon number, number of positions in the alignment, missing data, and noise. Finally, we use the DE-Score to re-evaluate several previously published datasets to illustrate its efficacy. Methods The methods and their implications are explored in greater detail in the pdf file that can be found inside folder 4_Other. 1_Kocot2017: Applying the DE-Score to real data: Reselection Datasets             The work of Kocot et al (2017) (Kocot, K.M., Struck, T.H., et al. 2017) was chosen for more thorough examination. This paper was chosen for reanalysis as the Slope score (Nosenko, T., Schreiber, F., et al. 2013) was previously used to assess site saturation within genes in this dataset, which allowed us to assess the coherence between the two metrics and the effect of their differences.             For this reselection study, a phylogenetic tree was recovered using IQTree-mpi (v1.6.12) (Nguyen, L.-T., Schmidt, H.A., et al. 2015) and the LG+F model, as in the original analysis, but instead using the sextile of least saturated genes, as selected by the DE-Score, and then a concatenated dataset comprising the 145 genes that were placed in the highest five out of six sextiles (523 genes) of the DE-Score, nRCFV, occupancy and LB-Score at once. These topologies were then compared to the sextiles selected by the Slope score and the highest five out of six tree in the initial study (Kocot, K.M., Struck, T.H., et al. 2017) .             The command used to generate the topologies in IQTree was:             iqtree-mpi -s -sp -m LG+F 2_MissingData_Simulations: Assessing the Effects of Missing Data             To assess the effect of missing data on the Dayhoff Category Exchange Ratio and the DE-Score, we used the 1,200 simulation datasets that were used to assess the effect of Missing Data on nRCFV in (Fleming, J.F. and Struck, T.H. 2023). These simulation datasets used the six "Missing Data" categories of Kocot et al (2017)(Kocot, K.M., Struck, T.H., et al. 2017). These datasets divided the total dataset assessed in that study into sextiles based on increasing percentage Missing Data, ranging from 18.17% to 38.43%. As per Kocot et al's methodology, missing data was classified as the presence of ambiguity characters, gaps or a lack of sampling (or absence) of the target gene in the taxon.              Simulation datasets were created using the alignment mimic command in IQTree2's alisim, which replicates the conditions of the source dataset - including missing data. This resulted in 600 simulation datasets including missing data, which we named "gapped" datasets. Simulated replicates of the same Kocot et al datasets were then generated using Alisim's -no-copy-gaps command, which generates gapless simulation datasets that otherwise replicate the source alignment. The following commands were used in IQTree2.2.0:             For “Gapped” datasets: iqtree2 –alisim < Output > -s < Missing Data Dataset >              For “Ungapped” datasets: iqtree2 –alisim < Output > -s < Missing Data Dataset > –no-copy-gaps. 3_NoiseSimulations: Assessing the Effects of Noise             To assess the effect of noise on the ratio on the Dayhoff Category Exchange Ratio and the DE-Score, we first selected one category of the simulation datasets used to initially assess changes in the number of taxa and positions - 100 simulation datasets with 250 taxa and 2100 positions. This dataset was chosen as it represented a medium-sized dataset among our simulation datasets. New simulation datasets were generated by combining the 100 simulated datasets with an additional 10%, 20%, 30%, 40% and 50% noise, resulting in datasets that were 2333, 2625, 3000, 3500 and 4200 amino acids long, respectively. Noise was generated by randomly generating a number between 1 and 20 for each site within each taxon in the alignment. Each number was assigned an amino acid prior to the generation of the random number string, and the randomly generated numbers were then transformed into amino acid noise. The script used to do this, Noisemaker, can be found inside section 4_Other, and at the DE-Score Calculator Github, here:             https://github.com/JFFleming/DEScore 5_PositionAndTaxaIncrease_DEScore: Assessing the Effects of an Increasing Number of Taxa and Positions on the Dayhoff Category Exchange Ratio             To assess the effect of changes in the number of taxa and positions in an alignment on the Dayhoff Category Exchange Ratio and the DE-Score, we generated 10,000 simulation datasets under the WAG+F+G model on a balanced tree, using the alisim function of IQTree version 2.3.6 (Nguyen, L.-T., Schmidt, H.A., et al. 2015, Ly-Trong, N., Naser-Khdour, S., et al. 2022), under the following command:             iqtree2 --alisim $SimulationPrefix --seed 2014 -m WAG+F+G -t RANDOM{bal/$taxaNumber} --length $alignmentLength --num-alignments 100 --out-format fasta -redo             WAG+F+G was used to create simulations that contained more compositional variability than the GTR model, and thereby more realistic variation in site saturation between simulation datasets. Simulation datasets were created in bins of 100 datasets each—taxa at intervals of 50 from 50 taxa to 500 and sequences at intervals of 300 from 300 to 3000 positions, producing an end product of 10,000 simulation datasets. 6_SaturationSims: Establishing the ability of the DE-Score to detect Site Saturation and comparing the DE-Score to the slope of the regression line of patristic vs. p-distances             To assess the utility of directly measuring the ratio of the DE-Score to assess site saturation within a dataset, we used simulation datasets previously used to assess site saturation by Hernandez & Ryan (2021) (Hernandez, A.M. and Ryan, J.F. 2021). These datasets, originally based on the Chang et al (2015) tree (Chang, E.S., Neuhof, M., et al. 2015), were generated by applying a scaling factor to all branches of the original tree, from 1 to 20. The datasets were generated under two models, the Dayhoff model and the JTT model, using Seq-Gen, as described in Hernandez & Ryan (2021) (Hernandez, A.M. and Ryan, J.F. 2021). Each scaling factor category comprises 1,000 datasets, resulting in a total of 20,000 datasets for each model category: for a final 40,000 datasets.             To assess the efficacy of our new metric in comparison to existing metrics, we calculated the slope of the regression line of patristic vs. p-distances as described in Nosenko et al (2013) and implemented in TreSpEx (Nosenko, T., Schreiber, F., et al. 2013, Struck, T.H. 2014). As the Slope R2 score has not yet been assessed on simulation datasets, we assessed its efficacy against the 20,000 simulation datasets generated under the Dayhoff model by Hernandez & Ryan (2021) (Hernandez, A.M. and Ryan, J.F. 2021) that were initially used to assess the efficacy of the DaCER.             We generated two variations of the Slope R2 score - one by comparing the patristic distances of the true tree to the p-distances and one by comparing the patristic distances of the trees generated by the Dayhoff model simulation datasets, analysed under the simpler JTT model. This was intended to model the effect of phylogenetic topological artifacts caused by site saturation on the Slope R2 score.             We then used the Spearman's rank correlation coefficient to assess each pair of conditions, by ranking the Slope R2 or DE-score of each dataset and comparing these ranks to assess correlation between the two metrics. 7_RealDataStudies: Applying the DE-Score to real data: Overview Datasets             To better understand the relationship between the DaCER and the DE-Score on real data, and to better explore its efficacy and use cases, we selected 6 previously published single protein phylogenetic datasets across 4 previously published papers (Vieira, F.G. and Rozas, J. 2011, Fleming, J.F., Pisani, D., et al. 2021, Novotná Floriančičová, K., Baltzis, A., et al. 2023, Giacomelli, M., Vecchi, M., et al. 2025), and 21 previously published multi-protein phylogenetic datasets from 21 previously published papers (Misof et al. 2014, Fernández et al. 2016, Fernández et al. 2017, Irisarri et al 2017, Kocot et al 2017, Peters et al. 2017, Fernandez et al 2018, Hughes et al 2018, Johnson et al. 2018, Sharma et al. 2018, Shen, Jin et al 2018, Shen, Opulente et al. 2018, Benavides et al. 2019, Evangelista et al 2019, Kawahara et al. 2019, Simon et al. 2019, Steenwyk et al 2019, Milla et al 2020, Mongiardo, Koch & Thompson 2021, Wibberg et al 2021, Herranz et al 2022). 8_ComparisonWithSatuRation: Comparing the DE-Score to the Mean Lambda Entropy             Using the branch scaling factor datasets previously used to compare the DE-Score to the Slope measurement (6_SaturationSims, we further examined the effectiveness of the DE-Score in comparison to another metric of site saturation in amino acid datasets: mean historical signal strength (λ).
创建时间:
2025-12-22
二维码
社区交流群
二维码
科研交流群
商业服务