five

Data from: Accounting for genotype uncertainty in the estimation of allele frequencies in autopolyploids

收藏
DataONE2015-11-20 更新2024-06-27 收录
下载链接:
https://search.dataone.org/view/null
下载链接
链接失效反馈
官方服务:
资源简介:
Despite the increasing opportunity to collect large-scale data sets for population genomic analyses, the use of high-throughput sequencing to study populations of polyploids has seen little application. This is due in large part to problems associated with determining allele copy number in the genotypes of polyploid individuals (allelic dosage uncertainty–ADU), which complicates the calculation of important quantities such as allele frequencies. Here, we describe a statistical model to estimate biallelic SNP frequencies in a population of autopolyploids using high-throughput sequencing data in the form of read counts. We bridge the gap from data collection (using restriction enzyme based techniques [e.g. GBS, RADseq]) to allele frequency estimation in a unified inferential framework using a hierarchical Bayesian model to sum over genotype uncertainty. Simulated data sets were generated under various conditions for tetraploid, hexaploid and octoploid populations to evaluate the model's performance and to help guide the collection of empirical data. We also provide an implementation of our model in the R package polyfreqs and demonstrate its use with two example analyses that investigate (i) levels of expected and observed heterozygosity and (ii) model adequacy. Our simulations show that the number of individuals sampled from a population has a greater impact on estimation error than sequencing coverage. The example analyses also show that our model and software can be used to make inferences beyond the estimation of allele frequencies for autopolyploids by providing assessments of model adequacy and estimates of heterozygosity.

尽管群体基因组分析的大规模数据集收集机遇与日俱增,但利用高通量测序技术研究多倍体群体的应用却极为有限。这在很大程度上源于多倍体个体基因型中等位基因拷贝数确定的难题(等位基因剂量不确定性,Allelic Dosage Uncertainty–ADU),该问题会使等位基因频率等关键统计量的计算变得复杂。本文提出一种统计模型,可基于读长计数形式的高通量测序数据,估算同源多倍体群体中的双等位基因单核苷酸多态性(biallelic SNP)频率。我们通过分层贝叶斯模型构建统一的推断框架,从基于限制性酶切的技术(如GBS、RADseq)的数据收集环节,一直延伸至等位基因频率估算,以此弥合二者间的鸿沟,实现对基因型不确定性的求和处理。本研究针对四倍体、六倍体及八倍体群体在多种条件下生成模拟数据集,以评估模型性能,并为实际实验数据的收集提供指导。我们还在R语言包polyfreqs中实现了该模型,并通过两个示例分析展示其用法:一是分析预期杂合度与观测杂合度水平,二是检验模型适配性。模拟实验结果表明,相较于测序覆盖度,从群体中采样的个体数量对估计误差的影响更为显著。示例分析同样证实,借助该模型与软件,我们不仅可估算同源多倍体的等位基因频率,还能通过模型适配性评估与杂合度估算开展更多维度的推断分析。
创建时间:
2015-11-20
二维码
社区交流群
二维码
科研交流群
商业服务