Twitter cascade dataset
收藏DataCite Commons2021-03-12 更新2024-07-13 收录
下载链接:
https://researchdata.smu.edu.sg/articles/dataset/Twitter_cascade_dataset/12062709
下载链接
链接失效反馈官方服务:
资源简介:
This dataset comprises a set of information cascades generated by Singapore Twitter users. Here a cascade is defined as a set of tweets about the same topic. This dataset was collected via the Twitter REST and streaming APIs in the following way. Starting from popular seed users (i.e., users having many followers), we crawled their follow, retweet, and user mention links. We then added those followers/followees, retweet sources, and mentioned users who state Singapore in their profile location. With this, we have a total of 184,794 Twitter user accounts. Then tweets are crawled from these users from 1 April to 31 August 2012. In all, we got 32,479,134 tweets. To identify cascades, we extracted all the URL links and hashtags from the above tweets. And these URL links and hashtags are considered as the identities of cascades. In other words, all the tweets which contain the same URL link (or the same hashtag) represent a cascade. Mathematically, a cascade is represented as a set of user-timestamp pairs. Figure 1 provides an example, i.e. cascade C = {< u1, t1 >, < u2, t2 >, < u1, t3 >, < u3, t4 >, < u4, t5 >}. For evaluation, the dataset was split into two parts: four months data for training and the last one month data for testing. Table 1summarizes the basic (count) statistics of the dataset. Each line in each file represents a cascade. The first term in each line is a hashtag or URL, the second term is a list of user-timestamp pairs. Due to privacy concerns, all user identities are anonymized.
本数据集收录多组由新加坡推特(Twitter)用户生成的信息传播级联(information cascade)。本研究将信息传播级联定义为围绕同一主题的一组推文。本数据集通过推特REST与流式API(Twitter REST and streaming APIs)按以下流程采集:首先选取热门种子用户(即拥有大量粉丝的用户),爬取其关注关系、转推关系与用户提及链路;随后筛选出个人资料位置标注为新加坡的粉丝、关注对象、转推源用户以及被提及用户,最终共得到184,794个推特用户账号。随后于2012年4月1日至8月31日期间,从上述用户账号中爬取推文,总计获取32,479,134条推文。为识别信息传播级联,我们从上述推文中提取所有URL链接与话题标签(hashtag),并将其作为级联的唯一标识。换言之,所有包含相同URL链接或相同话题标签的推文共同构成一个信息传播级联。从数学视角来看,信息传播级联可表示为一组用户-时间戳对的集合。图1展示了一个级联示例:C = {<u1, t1>, <u2, t2>, <u1, t3>, <u3, t4>, <u4, t5>}。为开展模型评估,本数据集被划分为两个子集:前四个月的数据用于模型训练,最后一个月的数据用于模型测试。表1汇总了本数据集的基础计数统计特征。数据集中每个文件的每一行对应一个信息传播级联:行首第一项为话题标签或URL链接,第二项为该级联对应的用户-时间戳对列表。出于隐私保护考量,所有用户身份均已完成匿名化处理。
提供机构:
SMU Research Data Repository (RDR)
创建时间:
2020-04-02
搜集汇总
数据集介绍

背景与挑战
背景概述
该数据集包含2012年新加坡Twitter用户生成的信息级联数据,通过URL和标签识别主题相关的推文集合,共收集了约18万用户超过3200万条推文,并以用户-时间戳对的形式表示级联关系。数据集已分为训练和测试两部分,所有用户身份均经过匿名化处理。
以上内容由遇见数据集搜集并总结生成



