Clustering multivariate longitudinal observations: The contaminated Gaussian hidden Markov model
收藏DataCite Commons2026-02-16 更新2024-07-25 收录
下载链接:
https://tandf.figshare.com/articles/dataset/Clustering_multivariate_longitudinal_observations_The_contaminated_Gaussian_hidden_Markov_model/1568562/2
下载链接
链接失效反馈官方服务:
资源简介:
The Gaussian hidden Markov model (HMM) is widely considered for the analysis of heterogeneous continuous multivariate longitudinal data. To robustify this approach with respect to possible elliptical heavy-tailed departures from normality, due to the presence of outliers, spurious points, or noise (collectively referred to as <i>bad points</i> herein), the contaminated Gaussian HMM is here introduced. The contaminated Gaussian distribution represents an elliptical generalization of the Gaussian distribution and allows for automatic detection of bad points in the same natural way as observations are typically assigned to the latent states in the HMM context. Once the model is fitted, each observation has a posterior probability of belonging to a particular state and, inside each state, of being a bad point or not. In addition to the parameters of the classical Gaussian HMM, for each state we have two more parameters, both with a specific and useful interpretation: one controls the proportion of bad points and one specifies their degree of atypicality. A sufficient condition for the identifiability of the model is given, an expectation-conditional maximization algorithm is outlined for parameter estimation and various operational issues are discussed. Using a large scale simulation study, but also an illustrative artificial dataset, we demonstrate the effectiveness of the proposed model in comparison with HMMs of different elliptical distributions, and we also evaluate the performance of some well-known information criteria in selecting the true number of latent states. The model is finally used to fit data on criminal activities in Italian provinces.
高斯隐马尔可夫模型(Gaussian hidden Markov model, HMM)被广泛用于异质性连续多变量纵向数据的分析。针对异常值、伪点或噪声(本文统称之为坏点(bad points))导致的潜在椭圆厚尾偏离正态性问题,为提升该方法的稳健性,本文提出了污染型高斯隐马尔可夫模型。污染型高斯分布是高斯分布的椭圆泛化形式,可按照与隐马尔可夫模型场景下观测值被分配至隐状态相同的自然方式,自动识别坏点。模型拟合完成后,每个观测值均可获得两个后验概率:一是归属某一特定隐状态的概率,二是在该隐状态下为坏点的概率。相较于经典高斯隐马尔可夫模型的参数,每个隐状态额外包含两个参数,二者均具有明确且实用的解释价值:其一用于控制坏点的占比,其二用于刻画坏点的非典型程度。本文给出了该模型可识别性的充分条件,概述了用于参数估计的期望条件最大化算法,并讨论了各类实操问题。通过大规模模拟实验以及示例人工数据集,本文将所提模型与基于不同椭圆分布的隐马尔可夫模型进行对比,验证了其有效性;同时还评估了若干经典信息准则在选取真实隐状态数量时的表现。最后,本文将所提模型应用于意大利各省份的犯罪活动数据拟合任务。
提供机构:
Taylor & Francis
创建时间:
2016-02-18



