five

Using SVD for Topic Modeling

收藏
DataCite Commons2022-10-11 更新2024-07-29 收录
下载链接:
https://tandf.figshare.com/articles/dataset/Using_SVD_for_Topic_Modeling/21084816
下载链接
链接失效反馈
官方服务:
资源简介:
The probabilistic topic model imposes a low-rank structure on the expectation of the corpus matrix. Therefore, singular value decomposition (SVD) is a natural tool of dimension reduction. We propose an SVD-based method for estimating a topic model. Our method constructs an estimate of the topic matrix from only a few leading singular vectors of the data matrix, and has a great advantage in memory use and computational cost for large-scale corpora. The core ideas behind our method include a pre-SVD normalization to tackle severe word frequency heterogeneity, a post-SVD normalization to create a low-dimensional word embedding that manifests a simplex geometry, and a post-SVD procedure to construct an estimate of the topic matrix directly from the embedded word cloud. We provide the explicit rate of convergence of our method. We show that our method attains the optimal rate in the case of long and moderately long documents, and it improves the rates of existing methods in the case of short documents. The key of our analysis is a sharp row-wise large-deviation bound for empirical singular vectors, which is technically demanding to derive and potentially useful for other problems. We apply our method to a corpus of Associated Press news articles and a corpus of abstracts of statistical papers. Supplementary materials for this article are available online.

概率主题模型(probabilistic topic model)会为语料矩阵(corpus matrix)的期望施加低秩结构。因此,奇异值分解(singular value decomposition, SVD)自然成为降维的优选工具。本文提出一种基于SVD的主题模型估计方法:该方法仅利用数据矩阵的少量主导奇异向量构建主题矩阵(topic matrix)的估计量,在大规模语料库(large-scale corpora)场景下,其内存占用与计算成本优势极为显著。本方法的核心设计思路包含三点:一是针对严重的词频异质性(word frequency heterogeneity)采用SVD前归一化步骤;二是通过SVD后归一化生成具备单纯形几何(simplex geometry)特性的低维词嵌入(low-dimensional word embedding);三是依托SVD后处理流程,直接从嵌入词云中构建主题矩阵的估计量。本文给出了所提方法的显式收敛速率(convergence rate),并证明:在长文档与中等长度文档场景下,该方法可达到最优收敛速率;在短文档场景下,其收敛速率优于现有同类方法。本研究的关键在于推导得到经验奇异向量(empirical singular vectors)的严格逐行大偏差界(row-wise large-deviation bound),该推导过程技术难度较高,且该结论可推广应用于其他相关问题。我们将所提方法应用于美联社新闻文稿语料库与统计学术论文摘要语料库。本文的补充材料可在线获取。
提供机构:
Taylor & Francis
创建时间:
2022-09-12
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作