Data from: Applications of random forest feature selection for fine-scale genetic population assignment
收藏DataONE2017-07-27 更新2024-06-26 收录
下载链接:
https://search.dataone.org/view/null
下载链接
链接失效反馈官方服务:
资源简介:
Genetic population assignment used to inform wildlife management and conservation efforts requires panels of highly informative genetic markers and sensitive assignment tests. We explored the utility of machine-learning algorithms (random forest, regularized random forest, and guided regularized random forest) compared with FST ranking for selection of single nucleotide polymorphisms (SNP) for fine-scale population assignment. We applied these methods to an unpublished SNP dataset for Atlantic salmon (Salmo salar) and a published SNP data set for Alaskan Chinook salmon (Oncorhynchus tshawytscha). In each species, we identified the minimum panel size required to obtain a self-assignment accuracy of at least 90% using each method to create panels of 50-700 markers Panels of SNPs identified using random forest-based methods performed up to 7.8 and 11.2 percentage points better than FST-selected panels of similar size for the Atlantic salmon and Chinook salmon data, respectively. Self-assignment accuracy ≥90% was obtained with panels of 670 and 384 SNPs for each dataset, respectively, a level of accuracy never reached for these species using FST-selected panels. Our results demonstrate a role for machine-learning approaches in marker selection across large genomic datasets to improve assignment for management and conservation of exploited populations.
用于指导野生动物管理与保护工作的遗传种群归属鉴定,需配备兼具高信息价值的遗传标记组与高灵敏度的归属鉴定方法。本研究对比了机器学习算法(随机森林(random forest)、正则化随机森林(regularized random forest)与引导型正则化随机森林(guided regularized random forest))与FST排序法,在筛选单核苷酸多态性(single nucleotide polymorphisms, SNP)以开展精细尺度种群归属鉴定中的应用效能。我们将这些方法应用于大西洋鲑(Salmo salar)的未公开SNP数据集,以及阿拉斯加奇努克鲑(Oncorhynchus tshawytscha)的公开SNP数据集。针对两个物种,我们分别通过每种方法构建50至700个标记的标记组,确定了实现至少90%自主归属鉴定准确率所需的最小标记组规模。基于随机森林的方法筛选出的SNP标记组,在大西洋鲑和奇努克鲑数据集上,分别比同规模FST筛选标记组的鉴定准确率高出最高7.8和11.2个百分点。两个数据集分别仅需由670个和384个SNP组成的标记组,即可达到≥90%的自主归属鉴定准确率,而使用FST筛选标记组的方法从未在这两个物种中实现该精度水平。本研究结果证实,机器学习方法可应用于大型基因组数据集的标记筛选,从而提升受开发利用种群的管理与保护相关归属鉴定工作的精度。
创建时间:
2017-07-27



