Fast Search and Estimation of Bayesian Nonparametric Mixture Models Using a Classification Annealing EM Algorithm
收藏DataCite Commons2025-04-01 更新2024-07-28 收录
下载链接:
https://tandf.figshare.com/articles/dataset/Fast_Search_and_Estimation_of_Bayesian_Nonparametric_Mixture_Models_Using_a_Classification_Annealing_EM_Algorithm/12844493/2
下载链接
链接失效反馈官方服务:
资源简介:
Bayesian nonparametric (BNP) infinite-mixture models provide flexible and accurate density estimation, cluster analysis, and regression. However, for the posterior inference of such a model, MCMC algorithms are complex, often need to be tailor-made for different BNP priors, and are intractable for large datasets. We introduce a BNP classification annealing EM algorithm which employs importance sampling estimation. This new fast-search algorithm, for virtually any given BNP mixture model, can quickly and accurately calculate the posterior predictive density estimate (by posterior averaging) and the maximum a-posteriori clustering estimate (by simulated annealing), even for datasets containing millions of observations. The algorithm can handle a wide range of BNP priors because it primarily relies on the ability to generate prior samples. The algorithm can be fast because in each iteration, it performs a sampling step for the (missing) clustering of the data points, instead of a costly E-step; and then performs direct posterior calculations in the M-step, given the sampled (imputed) clustering. The new algorithm is illustrated and evaluated through BNP Gaussian mixture model analyses of benchmark simulated data and real datasets. MATLAB code for the new algorithm is provided in the supplementary materials. Supplementary materials for this article are available online.
贝叶斯非参数(Bayesian nonparametric,BNP)无限混合模型具备灵活度高、精度优异的特点,可用于密度估计、聚类分析与回归建模。然而,针对此类模型的后验推断任务,马尔可夫链蒙特卡洛(Markov Chain Monte Carlo,MCMC)算法往往结构复杂,通常需要针对不同的BNP先验进行定制化设计,且在处理大规模数据集时难以落地。本文提出一种集成重要性采样估计的BNP分类退火EM算法。该新型快速搜索算法可适配几乎任意给定的BNP混合模型,能够快速且精准地计算后验预测密度估计(通过后验平均)与最大后验(maximum a-posteriori)聚类估计(通过模拟退火),即便针对包含数百万观测值的数据集也可高效运行。该算法可支持多种BNP先验,因为其核心依赖于先验样本的生成能力。该算法之所以高效,是因为在每一轮迭代中,它无需执行计算成本高昂的E步(期望步,Expectation-step),而是针对数据点的(缺失)聚类执行采样步骤;随后基于采样得到的(插补)聚类结果,在M步(最大化步,Maximization-step)中直接完成后验计算。本文通过基准模拟数据集与真实数据集的BNP高斯混合模型分析,对该新型算法进行了演示与验证。本文配套补充材料中提供了该算法的MATLAB代码。本文的补充材料可在线获取。
提供机构:
Taylor & Francis
创建时间:
2021-09-29



