five

Data from "Removing recombinant loci has minimal impact on species tree topologies estimated from empirical data"

收藏
Mendeley Data2024-06-27 更新2024-06-27 收录
下载链接:
https://figshare.com/articles/dataset/Data_from_Removing_recombinant_loci_has_minimal_impact_on_species_tree_topologies_estimated_from_empirical_data_/26087437/1
下载链接
链接失效反馈
官方服务:
资源简介:
Data from "Removing recombinant loci has minimal impact on species tree topologies estimated from empirical data"Figshare repository documentationCaitlin Cherryh 2024An underlying assumption in phylogenetics is that each site in a loci shares an identical evolutionary history that fits a single bifurcating tree. However, this assumption is broken by biological processes such as introgression or recombination. We selected four empirical datasets and investigated whether removing loci identified as putatively recombinant impacted species tree topology. To do so, we selected three tests for recombination detection (PHI, MaxChi, and GeneConv). We applied each test to each loci in each dataset. Then we used the results to break the loci into subsets. For each test, the set of loci was broken into a subset of loci that passed that test and a subset of loci that failed each test (i.e. loci that were identified as putatively recombinant). We then estimated species trees from each subset with both summary coalescent (ASTRAL-III) and maximum likelihood (IQ-Tree2) tree estimation methods. Finally, we compared the goodness of fit and topology of each tree.Replicating our analysesThe caitlinch/gene_filtering GitHub repository contains all R scripts necessary to repeat these analyses: https://github.com/caitlinch/gene_filtering.See the manuscript for detailed methods.Software programsTrees were estimated in IQ-Tree2 (http://www.iqtree.org/), ASTRAL (https://github.com/smirarab/ASTRAL), and RAxML-ng (https://github.com/amkozlov/raxml-ng)The recombination tests applied are available in the programs PHIPack (https://www.maths.otago.ac.nz/~dbryant/software.html) and GeneConv (https://www.math.wustl.edu/~sawyer/geneconv/)Tree adequacy tests were performed using the AU test (implemented in IQ-Tree2) and the QuartetNetwork Goodness of Fit test (https://github.com/cecileane/QuartetNetworkGoodnessFit.jl)Further details on software programs are available in the manuscript or the GitHub repository for this project (https://github.com/caitlinch/gene_filtering)Dataempirical_datasets.pdfDocumentation of the 4 empirical alignments analysed in this study, including original manuscript and record of where each matrix was obtained.datasets/One directory per dataset, containing the loci alignments used in our analysis1KP/1KP_alignments-FAA-masked_genes_renamed.zip: Loci alignments used for this analysis1KP/1KP_annotations.csv: CSV file from Leebens-Mack et al. (2019), outlining clades and classification for each taxonPease2016/Pease2016_all_window_alignments: Loci alignments used for this analysis. Generation of window alignments is described in methods of manuscript.Vanderpool2020/Vanderpool2020_1730_Alignments_FINAL.zip: Loci alignments used for this analysisWhelan2017_genes.zip: Loci alignments used for this analysistrees/All maximum likelihood (estimated in IQ-Tree) and summary (estimated in ASTRAL) trees from our analysisqcf/All quartet concordance factor results. One directory per dataset.files/00_1KP_loci_models_noFreeRates.csv: Model estimation for estimating maximum likelihood trees from the 1KP dataset. Details in manuscript.01_AllDatasets_IQ-Tree_warnings_LociToExclude.csv: List of loci to exclude from tree estimation, based on errors raised in IQ-Tree.01_AllDatasets_RecombinationDetection_complete_collated_results.csv: Results from applying the recombination tests to each gene02_AllDatasets_RecombinationDetection_PassFail_record.csv: Record of whether individual loci passed or failed the three tests for recombination.02_species_tree_summary_numbers.csv: Summary of species tree estimation process. Lists number of loci in each dataset that passed or failed each test for recombination.03_AllDatasets_collated_ComparisonTrees_AU_test_results.csv: AU test results for maximum likelihood trees.03_AllDatasets_collated_ComparisonTrees_QuarNetGoF_test_results.csv: Quartet Goodness of Fit test results for summary trees.03_AllDatasets_collated_RF_wRF_distances_results.csv: RF and wRF distances between trees.04_BranchSupport_values.csv: Branch support values (ultrafast bootstrap or local posterior probability).04_qCF_values.csv: quartet concordance factor results.

本研究数据源自论文《移除重组基因座对基于实证数据的物种树拓扑结构估计的影响极小》的Figshare仓库文档,作者为Caitlin Cherryh,发布于2024年。 系统发育学的核心预设为,基因座上的每一个位点均共享完全一致的演化历史,且契合单一的分叉式物种树。但基因渐渗、重组等生物学过程会打破这一预设。本研究选取4组实证数据集,探究移除被鉴定为潜在重组的基因座是否会对物种树拓扑结构产生影响。 为开展上述研究,我们选用PHI、MaxChi与GeneConv三种重组检测工具,对每个数据集的所有基因座逐一开展检测。基于检测结果将基因座划分为子集:针对每一种检测工具,将基因座分为通过该检测的子集,与被鉴定为潜在重组的未通过检测的子集。随后分别采用汇总式溯祖(summary coalescent)方法(ASTRAL-III)与最大似然(maximum likelihood)方法(IQ-Tree2),从每个子集重建物种树。最终对比各棵树的拟合优度与拓扑结构。 ### 分析复现 本研究的所有R脚本已公开于caitlinch/gene_filtering的GitHub仓库(https://github.com/caitlinch/gene_filtering),可用于复现本分析。详细方法请参见论文原稿。 ### 所用软件 物种树估计使用了IQ-Tree2(http://www.iqtree.org/)、ASTRAL(https://github.com/smirarab/ASTRAL)与RAxML-ng(https://github.com/amkozlov/raxml-ng)。重组检测工具可通过PHIPack(https://www.maths.otago.ac.nz/~dbryant/software.html)与GeneConv(https://www.math.wustl.edu/~sawyer/geneconv/)获取。树拟合优度检测采用近似无偏检验(AU test)与四分体网络拟合优度检验(QuartetNetwork Goodness of Fit test,https://github.com/cecileane/QuartetNetworkGoodnessFit.jl),其中AU检验在IQ-Tree2中实现。更多软件细节可参见论文或本项目的GitHub仓库(https://github.com/caitlinch/gene_filtering)。 ### 数据集文件说明 1. `empirical_datasets.pdf`:本研究分析的4组实证比对序列的文档说明,包含原始手稿与各序列矩阵的来源记录。 2. `datasets/`:按数据集划分的目录,每个目录包含本研究使用的基因座比对序列文件: - `1KP/1KP_alignments-FAA-masked_genes_renamed.zip`:本研究使用的基因座比对序列文件 - `1KP/1KP_annotations.csv`:源自Leebens-Mack等(2019)的CSV格式文件,包含各分类群的支系与分类信息 - `Pease2016/Pease2016_all_window_alignments`:本研究使用的基因座比对序列,窗口比对序列的生成方法详见论文方法部分 - `Vanderpool2020/Vanderpool2020_1730_Alignments_FINAL.zip`:本研究使用的基因座比对序列文件 - `Whelan2017_genes.zip`:本研究使用的基因座比对序列文件 3. `trees/`:本研究得到的所有最大似然树(由IQ-Tree估计)与汇总式树(由ASTRAL估计) 4. `qcf/`:所有四分体一致性因子(quartet concordance factor, QCF)结果,按数据集划分目录 5. `files/`:各类分析结果文件: - `00_1KP_loci_models_noFreeRates.csv`:1KP数据集最大似然树估计的模型参数文件,详细说明参见论文 - `01_AllDatasets_IQ-Tree_warnings_LociToExclude.csv`:基于IQ-Tree运行报错信息整理的需排除的基因座列表 - `01_AllDatasets_RecombinationDetection_complete_collated_results.csv`:所有基因座的重组检测汇总结果 - `02_AllDatasets_RecombinationDetection_PassFail_record.csv`:各基因座通过/未通过3种重组检测的记录 - `02_species_tree_summary_numbers.csv`:物种树估计流程汇总数据,列出各数据集通过/未通过各重组检测的基因座数量 - `03_AllDatasets_collated_ComparisonTrees_AU_test_results.csv`:最大似然树的AU检验结果 - `03_AllDatasets_collated_ComparisonTrees_QuarNetGoF_test_results.csv`:汇总式树的四分体拟合优度检验结果 - `03_AllDatasets_collated_RF_wRF_distances_results.csv`:树之间的罗宾逊-福尔茨(RF)距离与加权罗宾逊-福尔茨(wRF)距离结果 - `04_BranchSupport_values.csv`:分支支持值(超快速Bootstrap或局部后验概率) - `04_qCF_values.csv`:四分体一致性因子结果
创建时间:
2024-06-26
二维码
社区交流群
二维码
科研交流群
商业服务