Table_1_A maximum-likelihood method to estimate haplotype frequencies and prevalence alongside multiplicity of infection from SNP data.XLSX
收藏NIAID Data Ecosystem2026-03-14 收录
下载链接:
https://figshare.com/articles/dataset/Table_1_A_maximum-likelihood_method_to_estimate_haplotype_frequencies_and_prevalence_alongside_multiplicity_of_infection_from_SNP_data_XLSX/21193903
下载链接
链接失效反馈官方服务:
资源简介:
The introduction of genomic methods facilitated standardized molecular disease surveillance. For instance, SNP barcodes in Plasmodium vivax and Plasmodium falciparum malaria allows the characterization of haplotypes, their frequencies and prevalence to reveal temporal and spatial transmission patterns. A confounding factor is the presence of multiple genetically distinct pathogen variants within the same infection, known as multiplicity of infection (MOI). Disregarding ambiguous information, as usually done in ad-hoc approaches, leads to less confident and biased estimates. We introduce a statistical framework to obtain maximum-likelihood estimates (MLE) of haplotype frequencies and prevalence alongside MOI from malaria SNP data, i.e., multiple biallelic marker loci. The number of model parameters increases geometrically with the number of genetic markers considered and no closed-form solution exists for the MLE. Therefore, the MLE needs to be derived numerically. We use the Expectation-Maximization (EM) algorithm to derive the maximum-likelihood estimates, an efficient and easy-to-implement algorithm that yields a numerically stable solution. We also derive expressions for haplotype prevalence based on either all or just the unambiguous genetic information and compare both approaches. The latter corresponds to a biased ad-hoc estimate of prevalence. We assess the performance of our estimator by systematic numerical simulations assuming realistic sample sizes and various scenarios of transmission intensity. For reasonable sample sizes, and number of loci, the method has little bias. As an example, we apply the method to a dataset from Cameroon on sulfadoxine-pyrimethamine resistance in P. falciparum malaria. The method is not confined to malaria and can be applied to any infectious disease with similar transmission behavior. An easy-to-use implementation of the method as an R-script is provided.
基因组学方法的问世推动了标准化分子疾病监测工作的开展。例如,针对间日疟原虫(*Plasmodium vivax*)与恶性疟原虫(*Plasmodium falciparum*)引发的疟疾,其单核苷酸多态性(Single Nucleotide Polymorphism, SNP)条形码可用于表征单倍型的组成、频率与流行率,进而揭示传播过程的时间与空间分布模式。但存在一个混杂因素:同一感染个体内可存在多种遗传特征各异的病原体变异株,即感染复数(multiplicity of infection, MOI)。若沿用特设分析中常见的做法,直接忽略模糊信息,则会导致估计结果置信度不足且引入偏倚。本文提出一种统计分析框架,可从疟疾SNP数据(即多组双等位基因标记位点)中,同步估算单倍型频率、流行率以及感染复数的最大似然估计(maximum-likelihood estimates, MLE)。随着所考量的遗传标记数量增加,模型参数呈几何级数增长,且最大似然估计不存在解析解,因此需通过数值方法推导得到该估计值。本文采用期望最大化(Expectation-Maximization, EM)算法推导最大似然估计,该算法高效易用,可获得数值稳定性良好的解。此外,本文分别推导了基于全部遗传信息与仅基于明确遗传信息的单倍型流行率计算公式,并对两种分析方法进行了对比;后者即对应存在偏倚的特设流行率估计方法。本文通过系统数值模拟评估了所提出估计器的性能,模拟设定了符合实际研究的样本量与多种传播强度场景。结果表明,在合理的样本量与位点数条件下,该方法的偏倚极小。作为应用示例,本文将该方法应用于一项来自喀麦隆的恶性疟原虫疟疾磺胺多辛-乙胺嘧啶抗性相关数据集。本方法并不局限于疟疾研究,可推广应用于任何具有类似传播特征的传染病。本文还提供了该方法的易用R脚本实现版本。
创建时间:
2022-09-23



