Tournaments between markers as a strategy to enhance genomic predictions

Figshare2019-06-24 更新2026-04-29 收录

下载链接：

https://figshare.com/articles/dataset/Tournaments_between_markers_as_a_strategy_to_enhance_genomic_predictions/8314964

下载链接

链接失效反馈

官方服务：

资源简介：

Analysis of a large number of markers is crucial in both genome-wide association studies (GWAS) and genome-wide selection (GWS). However there are two methodological issues that restrict statistical analysis: high dimensionality (p≫n) and multicollinearity. Although there are methodologies that can be used to fit models for data with high dimensionality (eg, the Bayesian Lasso), a big problem that can occurs in this cases is that the predictive ability of the model should perform well for the individuals used to fit the model, but should not perform well for other individuals, restricting the applicability of the model. This problem can be circumvent by applying some selection methodology to reduce the number of markers (but keeping the markers associated with the phenotypic trait) before adjusting a model to predict GBVs. We revisit a tournament-based strategy between marker samples, where each sample has good statistical properties for estimation: n>p and low collinearity. Such tournaments are elaborated using multiple linear regression to eliminate markers. This method is adapted from previous works found in the literature. We used simulated data as well as real data derived from a study with SNPs in beef cattle. Tournament strategies not only circumvent the p≫n issue, but also minimize spurious associations. For real data, when we selected a few more than 20 markers, we obtained correlations greater than 0.70 between predicted Genomic Breeding Values (GBVs) and phenotypes in validation groups of a cross-validation scheme; and when we selected a larger number of markers (more than 100), the correlations exceeded 0.90, showing the efficiency in identifying relevant SNPs (or segregations) for both GWAS and GWS. In the simulation study, we obtained similar results.

全基因组关联研究（genome-wide association studies, GWAS）与全基因组选择（genome-wide selection, GWS）中，大量分子标记的分析均至关重要。然而，当前统计分析存在两大方法学局限：高维度（变量数远大于样本量，p≫n）与多重共线性。尽管已有针对高维度数据的建模方法（如贝叶斯套索 (Bayesian Lasso)），但此类方法仍存在显著缺陷：模型对建模所用的个体预测性能优异，却无法在其他个体上保持良好表现，进而限制了模型的适用性。在构建预测基因组育种值（Genomic Breeding Values, GBVs）的模型前，若先通过筛选方法减少标记数量（保留与表型性状相关的标记），则可规避上述问题。本文重新探讨了一种基于竞赛的标记样本筛选策略，该策略中每个标记样本均具备良好的统计估计特性：样本量大于标记数（n>p）且共线性水平较低。该竞赛策略通过多元线性回归实现标记剔除，其改编自已有文献中的相关研究。本研究使用了模拟数据，以及一项肉牛单核苷酸多态性（Single Nucleotide Polymorphisms, SNPs）研究中的真实数据。该竞赛策略不仅可解决p≫n的高维度问题，还能最大程度减少虚假关联。就真实数据而言，当筛选出20余个标记时，在交叉验证方案的验证组中，预测的基因组育种值与表型性状间的相关系数可达0.70以上；当筛选超过100个标记时，相关系数更是超过0.90，证明该方法可有效识别GWAS与GWS所需的相关单核苷酸多态性（或遗传分离位点）。模拟研究同样得到了一致的结果。

创建时间：

2019-06-24