Data from: Demographic model selection using random forests and the site frequency spectrum
收藏DataCite Commons2025-06-01 更新2025-06-15 收录
下载链接:
https://datadryad.org/dataset/doi:10.5061/dryad.2j27b
下载链接
链接失效反馈官方服务:
资源简介:
Phylogeographic data sets have grown from tens to thousands of loci in
recent years, but extant statistical methods do not take full advantage of
these large data sets. For example, approximate Bayesian computation (ABC)
is a commonly used method for the explicit comparison of alternate
demographic histories, but it is limited by the “curse of dimensionality”
and issues related to the simulation and summarization of data when
applied to next-generation sequencing (NGS) data sets. We implement here
several improvements to overcome these difficulties. We use a Random
Forest (RF) classifier for model selection to circumvent the curse of
dimensionality and apply a binned representation of the multidimensional
site frequency spectrum (mSFS) to address issues related to the simulation
and summarization of large SNP data sets. We evaluate the performance of
these improvements using simulation and find low overall error rates
(~7%). We then apply the approach to data from Haplotrema vancouverense, a
land snail endemic to the Pacific Northwest of North America. Fifteen
demographic models were compared, and our results support a model of
recent dispersal from coastal to inland rainforests. Our results
demonstrate that binning is an effective strategy for the construction of
a mSFS and imply that the statistical power of RF when applied to
demographic model selection is at least comparable to traditional ABC
algorithms. Importantly, by combining these strategies, large sets of
models with differing numbers of populations can be evaluated.
提供机构:
Dryad
创建时间:
2017-07-21



