Major inconsistencies of inferred population genetic structure estimated in a large set of domestic horse breeds using microsatellites

Mendeley Data2024-06-25 更新2024-06-29 收录

下载链接：

https://zenodo.org/records/4003075

下载链接

链接失效反馈

官方服务：

资源简介：

STRUCTURE remains the most applied tool aimed at recovering the true, but unknown, population structure from observed microsatellite data or other genetic markers. About 30% of STRUCTURE-based studies could not be reproduced (Gilbert et al., 2012). Here we use a large set of data from 2323 horses from 93 domestic breeds plus the Przewalski horse, typed at 15 microsatellite markers, to evaluate how program settings, in particular the so far insufficiently evaluated number of replicates, impact the estimation of the optimal number of population clusters Kopt that best describe the observed data. Domestic horses are suited as a test case as there is extensive knowledge of the history of many breeds, extensive phylogenetic analyses. Different methods based on different genetic assumptions and statistical procedures (DAPC, FLOCK, PCoA and STRUCTURE with different run scenarios) all revealed the general, broad-scale relationships among the breeds that largely reflect known breed histories but diverged largely how they characterized small-scale patterns. STRUCTURE failed to consistently identify Kopt using the most widespread approach, the ΔK method, despite very large numbers of MCMCs (3,000,000) and replicates (100). The interpretation of breed structure over increasing numbers of K, without assuming a Kopt, was consistent with known breed histories. The over-reliance on Kopt should be replaced by a qualitative description of clustering over increasing K, which is scientifically more honest and has the advantage of being much faster and less computer intensive as lower numbers of MCMC iterations and repetitions suffice for stable results. Very large data sets are highly challenging for cluster analyses, especially when populations with complex genetic histories are investigated.

STRUCTURE软件（STRUCTURE）仍是目前应用最广泛的工具，旨在从观测到的微卫星数据或其他遗传标记中重建真实但未知的种群遗传结构。据Gilbert等人2012年的研究，基于STRUCTURE的研究中约有30%无法被重复。本研究使用了涵盖93个家马品种、共计2323匹家马，以及普氏野马（Przewalski horse）的大样本数据集，该数据集对15个微卫星标记完成了基因分型，以此评估软件参数设置——尤其是迄今尚未得到充分研究的重复次数——会如何影响对最优拟合观测数据的种群聚类数Kopt的估计。家马适合作为本次研究的测试对象，因为学界对诸多马品种的演化历史已有充分认知，且已有大量系统发育分析成果。多种基于不同遗传假设与统计流程的分析方法（主成分判别分析（Discriminant Analysis of Principal Components，DAPC）、FLOCK、主坐标分析（Principal Coordinates Analysis，PCoA）以及设置不同运行参数的STRUCTURE软件）均揭示了各品种间大体符合已知品种演化历史的整体、大范围亲缘关系，但在刻画小尺度亲缘模式时结果差异显著。尽管使用了多达300万次的马尔可夫链蒙特卡洛（Markov Chain Monte Carlo，MCMC）迭代与100次重复运行，STRUCTURE软件仍无法通过目前应用最广泛的ΔK方法稳定识别出最优种群聚类数Kopt。在不预设Kopt的前提下，随着K值增大对品种遗传结构进行解读的结果，与已知的品种演化历史相符。学界对Kopt的过度依赖应当被替换为随K值增大对聚类结果进行定性描述的方式，这种方式在科学上更为严谨客观，且优势在于运行速度更快、对计算资源的需求更低——仅需更少的MCMC迭代次数与重复运行次数即可获得稳定的结果。超大型数据集对聚类分析而言极具挑战，尤其是在研究具有复杂遗传历史的种群时。

创建时间：

2023-06-28