Data from: Applications of random forest feature selection for fine-scale genetic population assignment

DataONE2017-07-27 更新2024-06-26 收录

下载链接：

https://search.dataone.org/view/null

下载链接

链接失效反馈

官方服务：

资源简介：

Genetic population assignment used to inform wildlife management and conservation efforts requires panels of highly informative genetic markers and sensitive assignment tests. We explored the utility of machine-learning algorithms (random forest, regularized random forest, and guided regularized random forest) compared with FST ranking for selection of single nucleotide polymorphisms (SNP) for fine-scale population assignment. We applied these methods to an unpublished SNP dataset for Atlantic salmon (Salmo salar) and a published SNP data set for Alaskan Chinook salmon (Oncorhynchus tshawytscha). In each species, we identified the minimum panel size required to obtain a self-assignment accuracy of at least 90% using each method to create panels of 50-700 markers Panels of SNPs identified using random forest-based methods performed up to 7.8 and 11.2 percentage points better than FST-selected panels of similar size for the Atlantic salmon and Chinook salmon data, respectively. Self-assignment accuracy ≥90% was obtained with panels of 670 and 384 SNPs for each dataset, respectively, a level of accuracy never reached for these species using FST-selected panels. Our results demonstrate a role for machine-learning approaches in marker selection across large genomic datasets to improve assignment for management and conservation of exploited populations.

用于指导野生动物管理与保护工作的遗传种群归属鉴定，需配备兼具高信息价值的遗传标记组与高灵敏度的归属鉴定方法。本研究对比了机器学习算法（随机森林（random forest）、正则化随机森林（regularized random forest）与引导型正则化随机森林（guided regularized random forest））与FST排序法，在筛选单核苷酸多态性（single nucleotide polymorphisms, SNP）以开展精细尺度种群归属鉴定中的应用效能。我们将这些方法应用于大西洋鲑（Salmo salar）的未公开SNP数据集，以及阿拉斯加奇努克鲑（Oncorhynchus tshawytscha）的公开SNP数据集。针对两个物种，我们分别通过每种方法构建50至700个标记的标记组，确定了实现至少90%自主归属鉴定准确率所需的最小标记组规模。基于随机森林的方法筛选出的SNP标记组，在大西洋鲑和奇努克鲑数据集上，分别比同规模FST筛选标记组的鉴定准确率高出最高7.8和11.2个百分点。两个数据集分别仅需由670个和384个SNP组成的标记组，即可达到≥90%的自主归属鉴定准确率，而使用FST筛选标记组的方法从未在这两个物种中实现该精度水平。本研究结果证实，机器学习方法可应用于大型基因组数据集的标记筛选，从而提升受开发利用种群的管理与保护相关归属鉴定工作的精度。

创建时间：

2017-07-27

5,000+

优质数据集

54 个

任务类型

进入经典数据集