Estimating accurate gene trees in the presence of intra-locus recombination: A simulation study
收藏NIAID Data Ecosystem2026-03-14 收录
下载链接:
http://datadryad.org/dataset/doi%253A10.5061%252Fdryad.6q573n626
下载链接
链接失效反馈官方服务:
资源简介:
Accurate gene trees are difficult to estimate with traditional methods due to the effects of recombination. New methods that co-estimate gene trees and recombination breakpoints function differently than the traditional maximum likelihood (ML) framework, and therefore have the potential to alleviate inaccuracies caused by recombination. However, the accuracy of gene trees produced by these methods has yet to be evaluated under a broad range of conditions. Using simulations, we studied gene tree accuracy in the presence of intra-locus recombination. Using a previously published model of human population history, we simulate the process of recombination along large sections of a genome to produce DNA sequence alignments. We varied three parameters that influence gene tree accuracy: recombination rate, population size, and substitution rate. We then compare the accuracy of gene trees estimated from different methodologies, including traditional maximum likelihood estimation of single and concatenated regions, as well as more sophisticated co-estimation methods. Unsurprisingly, we found that traditional approaches can only produce accurate gene trees in narrow regions of parameter space; as the number of sites used to estimate a gene tree increases, recombination becomes more and more problematic. Some, but not all, of the co-estimation methods successfully circumvent this tradeoff and have the potential to produce accurate gene trees in broader regions of parameter space. These results indicate that by adopting co-estimation methods, systematists may be able to improve gene tree accuracy.
Methods
A brief overview is given here; the methods section in the paper goes into more detail.
This is a simulation study of genetic data. Phylogenetic trees were simulated for regions of a certain size, recombination rate, and population size using the program Msprime. Next DNA sequence data were simulated using the program Seq-gen (11 different substitution rates).
Four programs use these simulated alignments to estimate gene trees: FastTree, Tsinfer, Relate, and Rent+. The trees estimated from these programs were then compared back to the corresponding simulated trees using Robinson-Foulds distance (using R packages ape and phangorn). The resulting RF distances are shown in heat maps to visualize how programs perform in different areas of parameter space.
Scripts to rerun simulations are included along with supplemental materials. Only simulated gene trees and their phylip alignments are included because the size of all input, output, and intermediate files totals around 1.5 terabytes. Upon request, we can submit specific data files such as the output tree files from Tsinfer, Rent+, and Relate.
创建时间:
2022-11-18



