five

Cellwise Outlier Detection in Heterogeneous Populations

收藏
Taylor & Francis Group2025-06-30 更新2026-04-16 收录
下载链接:
https://tandf.figshare.com/articles/dataset/Cellwise_outlier_detection_in_heterogeneous_populations/28931076/2
下载链接
链接失效反馈
官方服务:
资源简介:
Real-world applications may be affected by outlying values. In the model-based clustering literature, several methodologies have been proposed to detect units that deviate from the majority of the data (rowwise outliers) and trim them from the parameter estimates. However, the discarded observations can encompass valuable information in some observed features. Following the more recent cellwise contamination paradigm, we introduce a Gaussian mixture model for cellwise outlier detection. The proposal is estimated via an Expectation-Maximization (EM) algorithm with an additional step for flagging the contaminated <i>cells</i> of a data matrix and then imputing—instead of discarding—them before the parameter estimation. This procedure adheres to the spirit of the EM algorithm by treating the contaminated cells as missing values. We analyze the performance of the proposed model in comparison with other existing methodologies through a simulation study with different scenarios and illustrate its potential use for clustering, outlier detection, and imputation on three real datasets. Additional applications include socio-economic studies, environmental analysis, healthcare, and any domain where the aim is to cluster data affected by missing information and outlying values within features.

现实世界的应用场景常受异常值影响。在基于模型的聚类(model-based clustering)研究领域中,已有诸多方法被提出,用于识别偏离绝大多数数据的观测单元(行异常值,rowwise outliers),并在参数估计过程中将其剔除。然而,被剔除的观测样本可能在部分观测特征中包含有价值的信息。借鉴近年来兴起的逐单元格污染范式(cellwise contamination paradigm),本文提出一种用于逐单元格异常值(cellwise outlier)检测的高斯混合模型(Gaussian mixture model)。该模型通过期望最大化(Expectation-Maximization, EM)算法进行参数估计,额外增加了一步用于标记数据矩阵中受污染的单元格,随后在参数估计前对这些单元格进行插补而非直接剔除。该流程将受污染的单元格视为缺失值,契合了EM算法的核心思想。本文通过多场景模拟实验,对比现有方法分析了所提模型的性能,并基于三个真实数据集展示了其在聚类、异常值检测与插补任务中的应用潜力。该方法的额外应用场景包括社会经济研究、环境分析、医疗健康,以及所有旨在对存在特征内缺失信息与异常值的数据进行聚类的领域。
提供机构:
García-Escudero, Luis A.; Mayo-Íscar, Agustín; Zaccaria, Giorgia; Greselin, Francesca
创建时间:
2025-06-30
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作