Learning Topic Models: Identifiability and Finite-Sample Analysis

NIAID Data Ecosystem2026-03-13 收录

下载链接：

https://figshare.com/articles/dataset/Learning_Topic_Models_Identifiability_and_Finite-Sample_Analysis/20073585

下载链接

链接失效反馈

官方服务：

资源简介：

Topic models provide a useful text-mining tool for learning, extracting, and discovering latent structures in large text corpora. Although a plethora of methods have been proposed for topic modeling, lacking in the literature is a formal theoretical investigation of the statistical identifiability and accuracy of latent topic estimation. In this article, we propose a maximum likelihood estimator (MLE) of latent topics based on a specific integrated likelihood that is naturally connected to the concept, in computational geometry, of volume minimization. Our theory introduces a new set of geometric conditions for topic model identifiability, conditions that are weaker than conventional separability conditions, which typically rely on the existence of pure topic documents or of anchor words. Weaker conditions allow a wider and thus potentially more fruitful investigation. We conduct finite-sample error analysis for the proposed estimator and discuss connections between our results and those of previous investigations. We conclude with empirical studies employing both simulated and real datasets. Supplementary materials for this article are available online.

主题模型（Topic models）是一类实用的文本挖掘工具，可用于学习、提取并发现大规模文本语料库中的潜在结构。尽管学界已提出海量主题建模方法，但相关文献中仍缺乏针对潜在主题估计的统计可识别性与准确性的正规理论研究。本文基于一类与计算几何中的体积最小化概念天然关联的集成似然函数，提出了一种潜在主题的最大似然估计器（Maximum Likelihood Estimator, MLE）。我们的理论为主题模型的可识别性引入了一组全新的几何条件，该类条件弱于传统的可分性条件——传统可分性条件通常依赖于纯主题文档或锚定词的存在。更宽松的条件可支持更广泛、因而更具潜在研究价值的探索工作。我们针对所提出的估计器开展了有限样本误差分析，并探讨了本文研究结果与既往相关研究的关联。最后，本文通过模拟数据集与真实数据集开展实证研究，并提供了在线可获取的补充材料。

创建时间：

2022-06-15

5,000+

优质数据集

54 个

任务类型

进入经典数据集