Twitter-LDA
收藏NIAID Data Ecosystem2026-03-07 收录
下载链接:
https://figshare.com/articles/dataset/Twitter-LDA/12062730
下载链接
链接失效反馈官方服务:
资源简介:
Latent Dirichlet Allocation (LDA) has been widely used in textual analysis. The original LDA is used to find hidden "topics" in the documents, where a topic is a subject like "arts" or "education" that is discussed in the documents. The original setting in LDA, where each word has a topic label, may not work well with Twitter as tweets are short and a single tweet is more likely to talk about one topic. Hence, Twitter-LDA (T-LDA) has been proposed to address this issue. T-LDA also addresses the noisy nature of tweets, where it captures background words in tweets. As experiments in [7] have shown that T-LDA could capture more meaningful topics than LDA in Microblogs.
The original setting in Latent Dirichlet Allocation (LDA), where each word has a topic label, may not work well with Twitter as tweets are short and a single tweet is more likely to talk about one topic. Hence, Twitter-LDA (T-LDA) has been proposed to address this issue. T-LDA also addresses the noisy nature of tweets, where it captures background words in tweets.
Related Publication: Zhao, W. X., Jiang, J., Weng, J., He, J., Lim, E. P., Yan, H., & Li, X. (2011). Comparing twitter and traditional media using topic models. In Advances in Information Retrieval (pp. 338-349). http://doi.org/10.1007/978-3-642-20161-5_34
潜在狄利克雷分配(Latent Dirichlet Allocation,LDA)已被广泛应用于文本分析领域。原始LDA旨在从文档集合中挖掘隐含的“主题”,此处的主题指文档中讨论的诸如“艺术”“教育”这类核心议题。原始LDA的设定为每个词汇均对应一个主题标签,但该设定在推特场景下效果欠佳——由于单条推特篇幅极短,更倾向于仅围绕单一主题展开。为此,研究者提出了推特版潜在狄利克雷分配(Twitter-LDA,简称T-LDA)以解决该问题。T-LDA同时针对推特文本的噪声特性进行优化,能够捕捉推文中的背景词汇。已有实验[7]证实,在微博客语境下,T-LDA能够挖掘出比原始LDA更具实际意义的主题。
原始LDA的设定为每个词汇均对应一个主题标签,但该设定在推特场景下效果欠佳——由于单条推特篇幅极短,更倾向于仅围绕单一主题展开。为此,研究者提出了推特版潜在狄利克雷分配(Twitter-LDA,简称T-LDA)以解决该问题。T-LDA同时针对推特文本的噪声特性进行优化,能够捕捉推文中的背景词汇。
相关文献:Zhao, W. X., Jiang, J., Weng, J., He, J., Lim, E. P., Yan, H., & Li, X. (2011). 基于主题模型的推特与传统媒体对比研究. 载于《信息检索进展》(pp. 338-349). http://doi.org/10.1007/978-3-642-20161-5_34
创建时间:
2011-04-01



