Data for: An evaluation of document clustering and topic modelling in two online social networks: Twitter and Reddit
收藏doi.org2025-03-25 收录
下载链接:
http://doi.org/10.17632/85njyhj45m.1
下载链接
链接失效反馈官方服务:
资源简介:
Topic labelled online social network (OSN) data sets are useful to evaluate topic modelling and document clustering tasks. We provide three data sets with topic labels from two online social networks: Twitter and Reddit. To comply with Twitter’s terms and conditions, we only publish the tweet identifiers along with the topic label. The Reddit data is supplied with the full text and the topic label. The first Twitter data set was collected from the Twitter API by filtering for the hashtag #Auspol, used to tag political discussion tweets in Australia. The second Twitter data set was originally used in the RepLab 2013 competition and contains expert annotated topics. The Reddit data set consists of 40,000 Reddit parent comments from May 2015 belonging to 5 subreddit pages, which are used as topic labels.
主题标记的在线社交网络(OSN)数据集对于评估主题建模和文档聚类任务具有重要价值。本研究所提供的三份数据集均包含来自两个在线社交网络——Twitter和Reddit的主题标签。为遵守Twitter的使用条款和条件,我们仅发布了与主题标签相关的推文标识符。Reddit数据集则提供了全文以及主题标签。Twitter的第一份数据集通过Twitter API收集,经筛选含有用于标记澳大利亚政治讨论推文的标签#Auspol。第二份数据集最初用于2013年RepLab竞赛,并包含专家标注的主题。Reddit数据集由2015年5月的40,000条Reddit父级评论组成,这些评论归属于5个子版块页面,并作为主题标签使用。
提供机构:
doi.org



