Data from: Supertree-like methods for genome-scale species tree estimation
收藏Mendeley Data2024-03-27 更新2024-06-29 收录
下载链接:
https://databank.illinois.edu/datasets/IDB-4004605
下载链接
链接失效反馈官方服务:
资源简介:
This repository includes scripts and datasets for Chapter 6 of my PhD dissertation, " Supertree-like methods for genome-scale species tree estimation," that had not been published previously. This chapter is based on the article: Molloy, E.K. and Warnow, T. "FastMulRFS: Fast and accurate species tree estimation under generic gene duplication and loss models." Bioinformatics, In press. https://doi.org/10.1093/bioinformatics/btaa444. The results presented in my PhD dissertation differ from those in the Bioinformatics article, because I re-estimated species trees using FastMulRF and MulRF on the same datasets in the original repository (https://doi.org/10.13012/B2IDB-5721322_V1). To re-estimate species trees, (1) a seed was specified when running MulRF, and (2) a different script (specifically preprocess_multrees_v3.py from https://github.com/ekmolloy/fastmulrfs/releases/tag/v1.2.0) was used for preprocessing gene trees (which were then given as input to MulRF and FastMulRFS). Note that this preprocessing script is a re-implementation of the original algorithm for improved speed (a bug fix also was implemented). Finally, it was brought to my attention that the simulation in the Bioinformatics article differs from prior studies, because I scaled the species tree by 10 generations per year (instead of 0.9 years per generation, which is ~1.1 generations per year). I re-simulated datasets (true-trees-with-one-gen-per-year-psize-10000000.tar.gz and true-trees-with-one-gen-per-year-psize-50000000.tar.gz) using 0.9 years per generation to quantify the impact of this parameter change (see my PhD dissertation or the supplementary materials of Bioinformatics article for discussion).
本仓库包含本人博士学位论文第六章《基于类超树方法的全基因组物种树(species tree)推断》此前未公开发表的相关脚本与数据集。该章节基于以下研究论文:Molloy, E.K. 与 Warnow, T. 发表于《生物信息学》(Bioinformatics,已录用待刊)的《FastMulRFS:通用基因重复与丢失模型(gene duplication and loss models)下快速且精准的物种树推断方法》,DOI: 10.1093/bioinformatics/btaa444。本人博士论文中的结果与该《生物信息学》文章的结论存在差异,原因是本人在原有仓库(DOI: 10.13012/B2IDB-5721322_V1)的同一数据集上,重新使用FastMulRF与MulRF工具完成了物种树的推断。为实现物种树的重新推断,需完成两项操作:(1) 在运行MulRF时指定随机种子(random seed);(2) 采用不同的脚本(具体为来自https://github.com/ekmolloy/fastmulrfs/releases/tag/v1.2.0的preprocess_multrees_v3.py)对基因树(gene tree)进行预处理,随后将预处理后的基因树作为输入送入MulRF与FastMulRFS工具。需说明的是,该预处理脚本是对原有算法的重实现以提升运行效率,同时修复了一处程序漏洞。此外,本人注意到该《生物信息学》文章中的模拟实验与此前研究存在差异,原因是本人将物种树按每年10个世代进行缩放(而非此前的每代0.9年,即约每年1.1个世代)。为量化该参数变更带来的影响,本人使用每代0.9年的设置重新模拟了数据集(即true-trees-with-one-gen-per-year-psize-10000000.tar.gz与true-trees-with-one-gen-per-year-psize-50000000.tar.gz),相关讨论可参阅本人的博士学位论文或该《生物信息学》文章的补充材料。
创建时间:
2023-06-28



