five

Table_2_Training Set Construction for Genomic Prediction in Auto-Tetraploids: An Example in Potato.DOCX

收藏
NIAID Data Ecosystem2026-03-13 收录
下载链接:
https://figshare.com/articles/dataset/Table_2_Training_Set_Construction_for_Genomic_Prediction_in_Auto-Tetraploids_An_Example_in_Potato_DOCX/17073599
下载链接
链接失效反馈
官方服务:
资源简介:
Training set construction is an important prerequisite to Genomic Prediction (GP), and while this has been studied in diploids, polyploids have not received the same attention. Polyploidy is a common feature in many crop plants, like for example banana and blueberry, but also potato which is the third most important crop in the world in terms of food consumption, after rice and wheat. The aim of this study was to investigate the impact of different training set construction methods using a publicly available diversity panel of tetraploid potatoes. Four methods of training set construction were compared: simple random sampling, stratified random sampling, genetic distance sampling and sampling based on the coefficient of determination (CDmean). For stratified random sampling, population structure analyses were carried out in order to define sub-populations, but since sub-populations accounted for only 16.6% of genetic variation, there were negligible differences between stratified and simple random sampling. For genetic distance sampling, four genetic distance measures were compared and though they performed similarly, Euclidean distance was the most consistent. In the majority of cases the CDmean method was the best sampling method, and compared to simple random sampling gave improvements of 4–14% in cross-validation scenarios, and 2–8% in scenarios with an independent test set, while genetic distance sampling gave improvements of 5.5–10.5% and 0.4–4.5%. No interaction was found between sampling method and the statistical model for the traits analyzed.

训练集构建是基因组预测(Genomic Prediction, GP)的重要前置条件。尽管二倍体物种的基因组预测训练集构建研究已得到广泛开展,但多倍体物种尚未获得同等关注度。多倍性是众多作物的常见特征,例如香蕉、蓝莓,以及粮食消费量仅次于水稻和小麦的全球第三大粮食作物马铃薯。本研究旨在利用一套公开的四倍体马铃薯多样性群体,探究不同训练集构建方法对基因组预测的影响。本研究对比了四种训练集构建方法:简单随机抽样、分层随机抽样、遗传距离抽样,以及基于决定系数(CDmean)的抽样。针对分层随机抽样,本研究通过群体结构分析定义亚群,但由于亚群仅能解释16.6%的遗传变异,分层抽样与简单随机抽样之间的差异可忽略不计。在遗传距离抽样中,本研究对比了四种遗传距离度量方式,尽管各度量方式的表现相近,但欧氏距离的结果最为稳定一致。在绝大多数分析场景下,CDmean方法均为最优抽样策略:相较于简单随机抽样,其在交叉验证场景下可使预测精度提升4%~14%,在独立测试集场景下可提升2%~8%;而遗传距离抽样则分别可实现5.5%~10.5%与0.4%~4.5%的精度提升。在所分析的性状中,未发现抽样方法与统计模型之间存在交互效应。
创建时间:
2021-11-24
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作