100 Days of Tweet IDs and Most Frequent Terms in Tweets from_user_id_str 25073877

Name: 100 Days of Tweet IDs and Most Frequent Terms in Tweets from_user_id_str 25073877
Creator: city.figshare.com
Published: 2017-04-29 00:00:00
License: 暂无描述

city.figshare.com2017-04-29 更新2025-03-25 收录

下载链接：

https://city.figshare.com/articles/dataset/100_Days_of_Tweet_IDs_and_Most_Frequent_Terms_in_Tweets_from_user_id_str_25073877/4955231/1

下载链接

链接失效反馈

官方服务：

资源简介：

This is an Excel workbook containing two sheets. The first sheet contains 503 rows corresponding to 503 Tweet id strings from_user_id_str 25073877 and the following corresponding metadata:created_at time user_lang in_reply_to_user_id_str f from_user_id_str in_reply_to_status_id_str source user_followers_count user_friends_countTweet texts, URLs and other metadata such as profile_image_url, status_url and entities_str have not been included.An attempt to remove duplicated entries was made but duplicates might have remained so further data refining might be required prior to analyses.The second sheet contains 400 rows corresponding to the most frequent terms in the dataset's Tweets' texts. The text analysis was performed with the Terms Tool from Voyant Tools by Stéfan Sinclair & Geoffrey Rockwell (2017). An edited English stop words list was applied to remove Twitter data specific terms such as t.co, https, user names, etc. The analysed Tweets contained emojis and other special characters; due to character encoding these will be reflected in the terms list as character combinations. Motivations to Share this DataArchived Tweets can provide interesting insights for the study of contemporary history of media, politics, diplomacy, etc. The queried account is a public account widely agreed to be of exceptional national and international public interest. Though they provide public access to tweeted content in real time, Twitter Web and mobile clients are not suited for appropriate Tweet corpus analysis. For anyone researching social media, access to the data is absolutely essential in order to perform, review and reproduce studies. Archiving Tweets of public interest due to their historic significance is a means to both preserve and enable reproducible study of this form of rapid online communication that otherwise can very likely become unretrievable as time passes. Due to Twitter's current business model and API limits, to date collecting in real time is the only relatively reliable method to archive Tweets at a small scale.So far Twitter data analysis and visualisation has been done without researchers providing access to the source data that would allow reproducibility. It is appreciated that an Excel workbook is far from ideal as a file format, but due to the small scale the intention is to make this data human readable and available to researchers in a variety of non-technical fields. Methodology and LimitationsThe Tweets contained in this file were collected by Ernesto Priego using a Python script. The data collection search query was from:realdonaldtrump. A trigger was scheduled to collect atuomatically every hour, this means that any Tweets immediately deleted after publication have not been collected. The original data harvesting was refined to delete duplications, to subscribe to Twitter's Terms and Conditions and so that the data was sorted in chronological order.Duplication of data due to the automated collection is possible so further data refining might be required. The file may not contain data from Tweets deleted by the queried user account immediately after original publication. Both research and experience show that the Twitter search API is not 100% reliable. (Gonzalez-Bailon, Sandra, et al. 2012).Apart from the filters and limitations already declared, it cannot be guaranteed that this file contains each and every Tweet posted by the queried account during the indicated period. This file dataset is shared for archival, comparative and indicative educational research purposes only. The content included is from a public Twitter account and was obtained from the Twitter Search API. The shared data is also publicly available to all Twitter users via the Twitter Search API and available to anyone with an Internet connection via the Twitter and Twitter Search web client and mobile apps without the need of a Twitter account.The original Tweets, their contents and associated metadata were published openly on the Web from the queried public account and are responsibility of the original authors. Original Tweets are likely to be copyright their individual authors but please check individually. The license on this output applies to the data collection; third-party content should be attributed to the original authors and copyright owners. Please note that usernames, user profile pictures and full text of the Tweets collected have not been included in this file. No private personal information is shared in this dataset. As indicated above this dataset does not contain the text of the Tweets. The collection and sharing of this dataset is enabled and allowed by Twitter's Privacy Policy. The sharing of this dataset complies with Twitter's Developer Rules of the Road.This dataset is shared to archive, document and encourage open educational research into political activity on Twitter.Other ConsiderationsAll Twitter users agree to Twitter's Privacy and data sharing policies. Social media research remains in its infancy and though work has been done to develop best practices there is yet no agreement on a series of grey areas relating to reseach methodologies including ad hoc social media specific research ethics guidelines for reproducible research. It is understood that public figures Tweet publicly with the conscious intention to have their Tweets publicly accessed and discussed. It is assumed that a public figure Tweeting publicly is of public interest and that such figure, as a Twitter user, has given implicit consent, by agreeing explicitly to Twitter's Terms and Conditions, for their Tweets to be publicly accessed and discussed, including critical analysis, without the need for prior written permission. There is therefore no difference between collecting data and performing data analysis from a public printed or online publication and collecting data and performing data analysis of a dataset containing Twitter data from a public account from a public user in a public role. Though these datasets have limitations and are not thoroughly systematic, it is hoped they can contribute to developing new insights into the discipline's presence on Twitter over time. Reproducibility is considered here a key value for robust and trustworthy research. Different scholarly professional associations like the Modern Language Association recognise Tweets, datasets and other online and digital resources as citeable scholarly outputs.The data contained in the deposited file is otherwise available elsewhere through different methods.

本Excel工作簿包含两张工作表。第一张工作表包含503行数据，对应503条来自用户ID为25073877的推文ID字符串及其相关元数据，包括：创建时间、时间戳、用户语言、回复用户ID字符串、f、来源用户ID字符串、引用状态ID字符串、来源、用户关注者数量、用户好友数量。推文文本、URL以及其他元数据，如个人资料图片URL、状态URL和实体字符串未被包含。已尝试移除重复条目，但可能仍存在重复项，因此在分析前可能需要进一步的数据精炼。第二张工作表包含400行数据，对应数据集中推文文本中最频繁出现的术语。文本分析采用Voyant Tools中的术语工具（Stéfan Sinclair & Geoffrey Rockwell, 2017）进行。应用了经过编辑的英文停用词列表，以去除Twitter数据特有的术语，如t.co、https、用户名等。分析中的推文包含表情符号和其他特殊字符；由于字符编码，这些字符将反映在术语列表中为字符组合。数据共享的动机鉴于存档推文在当代媒体、政治、外交等历史研究中的价值。查询的账户是一个广受认可的具有非凡国内外公众利益的公共账户。尽管它们提供实时访问推文内容，但Twitter网页和移动客户端并不适合进行适当的推文语料库分析。对于任何研究社交媒体的人来说，访问这些数据对于执行、审查和重现研究至关重要。由于Twitter当前的业务模式和API限制，迄今为止，实时收集是唯一相对可靠的在小规模上存档推文的方法。迄今为止，Twitter数据分析与可视化尚未包括研究人员提供允许重现性的源数据。虽然Excel工作簿远非理想的文件格式，但鉴于规模较小，目的是使这些数据对非技术领域的各种研究人员具有可读性和可用性。方法论与局限性此文件中的推文由Ernesto Priego使用Python脚本收集。数据收集的搜索查询为“from:realdonaldtrump”。每小时自动触发收集，这意味着任何在发布后立即被删除的推文都没有被收集。原始数据收集经过优化以删除重复项，遵守Twitter的条款和条件，并确保数据按时间顺序排序。由于自动化收集，可能存在数据重复的问题，因此可能需要进一步的数据精炼。该文件可能不包含查询用户账户在原始发布后立即删除的推文。研究和经验表明，Twitter搜索API并非100%可靠（Gonzalez-Bailon, Sandra, 等人，2012）。除了已声明的过滤器和限制之外，不能保证此文件包含查询账户在指定期间发布的每一条推文。此数据集仅用于存档、比较和指示性教育研究目的。所包含的内容来自公共Twitter账户，并从Twitter搜索API获取。共享的数据也通过Twitter搜索API对所有Twitter用户公开，并且任何人通过Twitter和Twitter搜索网页客户端和移动应用即可访问，无需Twitter账户。原始推文、其内容和相关元数据在Web上公开发布，由原始作者负责。原始推文的内容可能受其个别作者的版权保护，但请个别检查。此输出的许可协议适用于数据收集；第三方内容应归因于原始作者和版权所有者。请注意，用户名、用户个人资料图片和收集的推文的完整文本未包含在此文件中。此数据集中不共享任何私人个人信息。如上所述，此数据集不包含推文文本。收集和共享此数据集是由Twitter的隐私政策所允许的。此数据集的共享符合Twitter的开发者道路规则。此数据集的共享是为了存档、记录并鼓励对Twitter上政治活动的开放教育研究。其他考虑事项所有Twitter用户都同意Twitter的隐私和数据共享政策。社交媒体研究仍处于初级阶段，尽管已经完成了开发最佳实践的工作，但关于研究方法的一系列灰色区域尚未达成共识，包括针对可重现研究的特定社交媒体研究伦理指南。了解公众人物公开推文是带有意识地将推文公开访问和讨论的意图。假设公开推文的公众人物由于作为Twitter用户而同意Twitter的条款和条件，因此默认同意其推文被公开访问和讨论，包括批判性分析，无需事先书面许可。因此，从公共印刷或在线出版物收集数据并进行数据分析与从公开用户在公开角色中的公共账户收集包含Twitter数据的数据集进行数据分析和执行数据分析之间没有区别。尽管这些数据集存在局限性且不是完全系统的，但希望它们可以随着时间的推移为该学科在Twitter上的存在提供新的见解。可重现性被视为稳健和可信研究的关键价值。不同的学术专业协会，如现代语言协会，将推文、数据集和其他在线和数字资源视为可引用的学术成果。存档文件中的数据通过其他方法在其他地方也可获得。

提供机构：

city.figshare.com