Good Morning Tweets

Name: Good Morning Tweets
Creator: Kaggle
Published: 2016-12-09 00:00:00
License: 暂无描述

www.kaggle.com2016-12-09 更新2025-01-16 收录

下载链接：

https://www.kaggle.com/tentotheminus9/good-morning-tweets

下载链接

链接失效反馈

官方服务：

资源简介：

# Context It's possible, using R (and no doubt Python), to 'listen' to Twitter and capture tweets that match a certain description. I decided to test this out by grabbing tweets with the text 'good morning' in them over a 24 hours period, to see if you could see the world waking up from the location information and time-stamp. The main R package used was [streamR][1] # Content The tweets have been tidied up quite a bit. First, I've removed re-tweets, second, I've removed duplicates (not sure why Twitter gave me them in the first place), third, I've made sure the tweet contained the words 'good morning' (some tweets were returned that didn't have the text in for some reason) and fourth, I've removed all the tweets that didn't have a longitude and latitude included. This latter step removed the vast majority. What's left are various aspects of just under 5000 tweets. The columns are, - text - retweet_count - favorited - truncated - id_str - in_reply_to_screen_name - source - retweeted - created_at - in_reply_to_status_id_str - in_reply_to_user_id_str - lang - listed_count - verified - location - user_id_str - description - geo_enabled - user_created_at - statuses_count - followers_count - favourites_count - protected - user_url - name - time_zone - user_lang - utc_offset - friends_count - screen_name - country_code - country - place_type - full_name - place_name - place_id - place_lat - place_lon - lat - lon - expanded_url - url # Acknowledgements I used a few blog posts to get the code up and running, including [this one][2] # Code The R code I used to get the tweets is as follows (note, I haven't includes the code to set up the connection to Twitter. See the streamR PFD and the link above for that. You need a Twitter account), i = 1 while (i <= 280) { filterStream("tw_gm.json", timeout = 300, oauth = my_oauth, track = 'good morning', language = 'en') tweets_gm = parseTweets("tw_gm.json") ex = grepl('RT', tweets_gm$text, ignore.case = FALSE) #Remove the RTs tweets_gm = tweets_gm[!ex,] ex = grepl('good morning', tweets_gm$text, ignore.case = TRUE) #Remove anything without good morning in the main tweet text tweets_gm = tweets_gm[ex,] ex = is.na(tweets_gm$place_lat) #Remove any with missing place_latitude information tweets_gm = tweets_gm[!ex,] tweets.all = rbind(tweets.all, tweets_gm) #Add to the collection i=i+1 Sys.sleep(5) } [1]: https://cran.r-project.org/web/packages/streamR/streamR.pdf [2]: http://politicaldatascience.blogspot.co.uk/2015/12/rtutorial-using-r-to-harvest-twitter.html

运用 R 语言（无疑 Python 同样适用），得以实现监听 Twitter 并捕捉符合特定描述的推文。本研究旨在通过收集包含‘早上好’字样的推文，持续24小时，以观察世界各地的用户是否从地理位置信息和时间戳中显现出起床的迹象。主要使用的 R 包为 [streamR][1]。推文数据经过精心整理。首先，去除了转发推文，其次，移除了重复数据（不清楚 Twitter 初次提供时为何包含这些），第三，确保推文中包含‘早上好’（部分推文由于某些原因未包含该文本），第四，移除了所有未包含经纬度信息的推文，这一步骤去除了绝大多数推文。剩余的近5000条推文涵盖了多个方面。数据列包括： - 文本（text） - 转发数（retweet_count） - 被收藏数（favorited） - 是否被截断（truncated） - 推文ID（id_str） - 回复的推文用户名（in_reply_to_screen_name） - 推文来源（source） - 是否被转发（retweeted） - 创建时间（created_at） - 回复的推文ID（in_reply_to_status_id_str） - 回复的用户ID（in_reply_to_user_id_str） - 语言（lang） - 列表计数（listed_count） - 是否认证（verified） - 地理位置（location） - 用户ID（user_id_str） - 用户描述（description） - 是否启用地理位置（geo_enabled） - 用户创建时间（user_created_at） - 推文总数（statuses_count） - 粉丝数（followers_count） - 收藏数（favourites_count） - 是否受保护（protected） - 用户链接（user_url） - 用户名（name） - 时区（time_zone） - 用户语言（user_lang） - UTC 偏移量（utc_offset） - 好友数（friends_count） - 屏幕名（screen_name） - 国家代码（country_code） - 国家（country） - 地点类型（place_type） - 完整名称（full_name） - 地点名称（place_name） - 地点ID（place_id） - 纬度（place_lat） - 经度（place_lon） - 纬度（lat） - 经度（lon） - 扩展链接（expanded_url） - 链接（url）。感谢以下博客文章提供了运行代码的指导，包括 [这篇文章][2]。所使用的 R 代码如下（注意，此处未包括连接到 Twitter 的代码。请参阅 streamR 的 PDF 文档和上述链接。您需要一个 Twitter 账户）， i = 1 while (i <= 280) { filterStream("tw_gm.

提供机构：

Kaggle

5,000+

优质数据集

54 个

任务类型

进入经典数据集