five

Model-based Clustering and Prediction with Mixed Measurements involving Surrogate Classifiers

收藏
DataCite Commons2022-08-02 更新2024-08-17 收录
下载链接:
https://tandf.figshare.com/articles/dataset/Model-based_Clustering_and_Prediction_with_Mixed_Measurements_involving_Surrogate_Classifiers/13373246/1
下载链接
链接失效反馈
官方服务:
资源简介:
Identification of underlying subpopulations to account for unobserved heterogeneity in the population is a challenging statistical problem, mainly because no explicit information about the latent classes is available. Although latent class analysis via finite mixture models is often used successfully to probabilistically identify subpopulations in applications, it often fails with data for which such subpopulations exhibit high latency. Borrowing strength from readily accessible auxiliary classifiers, even when subject to misclassification, may yield improved results in such settings. We develop in this paper a joint modelling approach that combines data from multiple sources, including observed characteristics that are often used alone for clustering and classification, as well as results based on imperfect surrogate classifiers, in order to better identify the latent classes for more accurate classification and prediction. We outline maximum likelihood estimation for the joint model using the EM algorithm, and we show empirically via simulations that our methodology yields better estimates of the underlying latent class distributions than those obtained by ignoring the auxiliary information, while providing joint assessments of the surrogate classifiers. The advantages are significant when there is high latency and the surrogate classifiers are at least moderately accurate. We use real diagnostic data on dry eye disease, for which no gold standard is available, to illustrate our methodology.

识别用以解释总体未观测异质性的潜在子群体,是一项极具挑战性的统计问题,其核心难点在于缺乏潜在类别的明确观测信息。尽管基于有限混合模型(finite mixture models)的潜类别分析(latent class analysis)常能在实际应用中通过概率框架有效识别子群体,但当这类子群体的潜伏性极强时,该方法往往难以取得理想效果。 借助易获取的辅助分类器所提供的信息——即便这些分类器存在分类误差——或可在这类场景下获得更优的分析结果。本文提出一种联合建模方法,整合多源数据:既包含通常单独用于聚类与分类的观测特征,也纳入基于不完全可靠替代分类器(surrogate classifiers)的预测结果,以更精准地识别潜在类别,实现更准确的分类与预测任务。 我们针对该联合模型推导了基于EM算法的最大似然估计(maximum likelihood estimation)方法,并通过仿真实验证实:相较于忽略辅助信息的传统分析范式,本文所提方法能更准确地估计潜在类别分布,同时可对替代分类器进行联合性能评估。当潜在子群体潜伏性较强、且替代分类器具备至少中等程度的准确率时,该方法的优势尤为显著。 我们采用无金标准(gold standard)可用的干眼症(dry eye disease)真实诊断数据集,对所提方法进行了实例验证。
提供机构:
Taylor & Francis
创建时间:
2020-12-14
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作