five

Data from: Estimating genotyping errors from genotype and reconstructed pedigree data

收藏
DataONE2017-08-07 更新2024-06-26 收录
下载链接:
https://search.dataone.org/view/null
下载链接
链接失效反馈
官方服务:
资源简介:
1. Genotyping errors are rules rather than exceptions in reality, and are found in virtually all but very small datasets. These errors, even when occurring at an extremely low rate, can derail many genetic analyses such as parentage/sibship assignments and linkage/association studies. 2. Nonetheless, few robust and accurate methods are available for estimating the rate of occurrence of genotyping errors and for identifying individual erroneous genotypes at a locus. Methods based on duplicate genotyping are expensive, and estimate genotype inconsistency rather than error rate at a locus. Methods based on Hardy-Weinberg equilibrium tests have low robustness and low power, and apply only to those particular errors that cause excessive homozygosity. Methods based on pedigrees are powerful, robust and accurate. However, they rely on known and complete pedigrees that are unfortunately rarely available from natural populations in the wild. 3. I proposed a maximum likelihood method to reconstruct pedigrees from genotype data with errors occurring at a roughly estimated (presumed) rate. In this paper, I describe how to use the method and inferred pedigree in estimating allelic dropout (or null allele) rate and false allele rate jointly at each marker locus, in identifying the erroneous genotypes, and in inferring the most likely genotypes at each locus of each individual. I examine the power, accuracy and robustness of the method by extensive simulations, and demonstrate the usefulness of the method by analysing three empirical datasets. 4. It is concluded that, both pedigrees and the rates of genotyping errors at each locus can be reliably estimated from the same genotype data by the same likelihood method, when marker information is sufficient and some sampled individuals are first-degree relatives. The erroneous genotypes are however inferred conservatively, and are reliably detected only when they occur in large families and/or at highly polymorphic loci. Estimation of genotyping error rates per locus and identification of erroneous genotypes of each individual at each locus should be routinely conducted to assess and improve data quality, to highlight markers for optimization of genotyping protocols or for replacement, and to enable the integration of genotyping errors in a robust statistical analysis.

1. 现实中,基因分型错误(genotyping error)实为常态而非例外,几乎所有规模稍大的数据集均存在此类误差。即便这类错误的发生概率极低,也可能破坏诸多遗传分析工作,例如亲权/同胞关系鉴定、连锁分析与关联研究。 2. 然而,目前鲜有兼具稳健性与准确性的方法,可用于估算基因分型错误的发生概率,同时鉴定单个位点上的错误基因型。基于重复基因分型的方法成本高昂,且仅能估算基因型不一致性,而非单个位点的错误率。基于哈迪-温伯格平衡(Hardy-Weinberg equilibrium)检验的方法稳健性与效力均较低,且仅适用于导致纯合性过度增加的特定类型错误。基于系谱(pedigree)的方法则效力强、稳健且准确,但这类方法依赖已知且完整的系谱信息,而野生自然种群中往往难以获取此类信息。 3. 本研究提出一种最大似然(maximum likelihood)方法,可基于存在粗略估算(假定)发生率的基因分型错误的基因型数据重构系谱。本文详述了如何利用该方法与重构得到的系谱,联合估算每个标记位点(marker locus)的等位基因丢失(allelic dropout,又称无效等位基因(null allele))率与假等位基因率,鉴定错误基因型,并推断每个个体在每个位点上的最可能基因型。本研究通过大量模拟实验检验了该方法的效力、准确性与稳健性,并通过分析三个实证数据集验证了其实用性。 4. 研究结论表明,当标记信息充足且部分采样个体为一级亲属(first-degree relatives)时,可通过同一似然方法从同一基因型数据中可靠估算系谱与每个位点的基因分型错误率。不过,错误基因型的推断较为保守,仅在大型家系中或高度多态性位点(polymorphic loci)上发生时,才能被可靠检测。应常规开展每位点基因分型错误率的估算,以及每个个体在每个位点上错误基因型的鉴定工作,以此评估并提升数据质量,筛选出需优化或替换的标记位点,同时为稳健统计分析中整合基因分型错误提供支撑。
创建时间:
2017-08-07
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作