five

Harvard CGA Geotweet Archive v2.0

收藏
DataCite Commons2025-03-11 更新2025-04-15 收录
下载链接:
https://dataverse.harvard.edu/citation?persistentId=doi:10.7910/DVN/3NCMB6
下载链接
链接失效反馈
官方服务:
资源简介:
<p><b> Geotweet Archive v2.0 </b> The Harvard Center for Geographic Analysis (CGA) maintains the Geotweet Archive, a global record of tweets spanning time, geography, and language. The primary purpose of the Archive is to make a comprehensive collection of geo-located tweets available to the academic research community. </p> <p>The Archive extends from 2010 to July 12, 2023 when Twitter stopped allowing free access to its API, transitioning API access to a paid model. The number of tweets in the collection totals approximately 10 billion, and it is stored on <a href="https://www.rc.fas.harvard.edu/about/cluster-architecture/">Harvard University’s High Performance Computing (HPC) cluster</a>. The Harvard HPC supports many applications for working with big spatio-temporal datasets, including two geospatial tools recently deployed by the CGA: Heavy.ai, and PostGIS. </p> <p> The Geotweet Archive consists of tweets which carry two types of geospatial signature: </p> <p>1) GPS-based longitude/latitude generated by the originating device<p> <p>2) Place-name-centroid-based longitude/latitude from the bounding box provided by Twitter, based on the user-define place designation (typically a town name).</p> <p> Any tweet which carries one or both of these signatures is included in the Archive. Approximately 1-2% of all tweets contain such geographic coordinates, (this percentage needs verification and may vary over time). </p> <p> The current version of the Archive is Version 2.0. The original Version 1.0 archive began in 2012 as part of a project started by Ben Lewis of CGA and then Harvard graduate student Todd Mostak, to develop a GPU-powered spatial database called GEOPS. GEOPS formed the basis for technology startup MapD Technologies, which then became OmniSci, and is now known as Heavy.ai. Heavy.ai Immerse software now runs on Harvard’s High Performance Computing (HPC) environment to support interactive exploration and analytics with the Geotweet Archive and any other large datasets. </p> <p> Version 2.0 of the archive represents the results of a merge between the CGA archive, and an archive developed by the Department of Geoinformatics at the University of Salzburg in Austria, as well as several other archives lead by Ben Lewis of Harvard CGA. Clemens Havas and Bernd Resch at University of Salzburg, worked with Devika Kakkar of Harvard CGA, to deploy Version 2.0. </p> ======================================================== <p>Schema of Geotweet Archive v2.0 </p> <p><b>Field name____TYPE____Description</b></p> <p><b>message_id</b>----BIGINT----Tweet ID</p> <p><b>tweet_date</b>----TIMESTAMP----Date and time of tweet from Twitter (utc)</p> <p><b>tweet_text</b>----TEXT ENCODING----Text content of tweet</p> <p><b>tags</b>----TEXT ENCODING DICT----Tweet hashtags </p> <p><b>tweet_lang</b>----TEXT ENCODING DICT----Language that the tweet is in</p> <p><b>source </b>----TEXT ENCODING DICT----Operating system or application type used to create the tweet</p> <p><b>place*</b>----TEXT ENCODING NONE----The geographic place as defined by the user, usually a town name. A bounding box determined by Twitter based on this field, from which centroids (see longitude and latitude fields) and the spatial_error field are derived, and used when not overridden by a GPS coordinate. See <a href="https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/tweet-object/">Twitter tweet object for place</a>.</p> <p><b>retweets </b>----SMALLINT----Number of retweets as of last time it was checked</p> <p><b>tweet_favorites</b>----SMALLINT----Now known as ‘likes’</p> <p><b>photo_url</b>----TEXT ENCODING DICT----URL of any image referenced</p> <p><b>quoted_status_id </b>----BIGINT----ID number for quote status</p> <p><b>user_id </b>----BIGINT----User ID number</p> <p><b>user_name</b>----TEXT ENCODING NONE----User name</p> <p><b>user_location*</b>----TEXT ENCODING NONE----User defined location, usually a city or town. See <a href="https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/user-object/">Twitter user object</a>.</p> <p><b>followers </b>----SMALLINT----Followers as of the last time checked</p> <p><b>friends </b>----SMALLINT----Number of users followed by this user</p> <p><b>user_favorites</b>----INT----Number of topics the user is interested in</p> <p><b>status</b>----INT----Code for what user is doing as of last time it was checked</p> <p><b>user_lang</b>----TEXT ENCODING DICT----User defined language </p> <p><b>latitude</b>----FLOAT----Latitude from GPS or bounding box based on Place field</p> <p><b>longitude</b>----FLOAT----Longitude from GPS or bounding box based on Place field</p> <p><b>data_source*</b>----TEXT ENCODING DICT----The source crawler or dataset for the tweet</p> <p><b>gps</b>----TEXT ENCODING DICT----Flag for whether lon/lat is from GPS or town name bounding box (SRID – 4326). When both are present, the GPS coordinate takes priority.</p> <p><b>spatialerror</b>----FLOAT----Estimate in meters horizontal error for lon/lat coordinate. 10m for GPS coordinates, error for bounding boxes calculated as radius of circle with area of bounding box. </p> ===================================================== <p> </p> <p> <b>*data_source____Code</b></p> <p>U. Salzburg REST API crawler----1</p> <p>Harvard CGA streaming crawler----2</p> <p>U. Salzburg streaming API crawler----3</p> <p>Ryan Qi Wang and Harvard Medical School datasets----4</p> <p>U. Heidelberg dataset----5</p> <a href="https://archive.org/details/twitterstream/">Archive.org dataset</a>----6 </p> ---------------------------------------------------------------------------------------------- <p> Note: Before April of 2015 the default for GPS coordinate capture was turned on for Twitter users. After this date users have had to opt-in to share their precise location. This is one reason for the large decrease in volume of geotweets after this date. </p> <p> A number of automated tweet-bots have been discovered which generate tweets with (apparently) randomly spoofed coordinates. These bots appear to make up no more than a few percent of harvested geotweets. A list of bot sender names so far discovered is here. Tweets from these sender names were not generated by a human with a mobile device since they are randomly scattered across the globe:see our current <a href="https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi%3A10.7910%2FDVN%2F7OTPCI">list of tweet bots</a>. </p> <p> If you are interested in accessing the archive please, please fill out <a href="https://gis.harvard.edu/contactus">CGA contact form</a>. <p> <p> Before requesting or receiving Tweet IDs, requestors must agree to <a href="https://twitter.com/en/tos">Twitter's Terms of Service</a>, <a href="https://twitter.com/en/privacy">Twitter's Privacy Policy</a>, <a href="https://developer.twitter.com/en/developer-terms/agreement>Developer Agreement"</a> and <a href="https://developer.twitter.com/en/developer-terms/policy">Developer Policy</a>. <p> <p> Tweet data provided by CGA may only be used for not-for-profit research and for academic purposes. Recipients may not share CGA provided tweet IDs or tweets or content derived from them without written permission from the CGA. <p> <p>CITATIONS: If you use the Geotweet Archive in your research please reference it: "Harvard Center for Geographic Analysis Geotweet Archive, (https://doi.org/10.7910/DVN/3NCMB6). <p> <p> For examples of geospatial systems capable of querying, visualizing, analyzing millions or even billions of objects, please see the Heavy.ai Enterprise and PostGIS database platforms which are running on the <a href="https://www.rc.fas.harvard.edu/">Harvard FAS Research Cluster</a> . <p> <b>Note for Non-Harvard Researchers </b>: We are unable to share the full raw tweets with anyone outside of Harvard as per <a href="https://developer.twitter.com/en/developer-terms/agreement-and-policy">Twitter's content redistribution policy</a>. We are not longer harvesting since July 12, 2023 due to change in Twitter's policy.
提供机构:
Harvard Dataverse
创建时间:
2016-07-07
搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
Harvard CGA Geotweet Archive v2.0是一个全球地理标记推文的学术研究数据集,覆盖2010年至2023年,包含约100亿条推文,支持多种地理空间分析工具。数据集详细记录了推文的地理信息和多种元数据,但使用受限于Twitter的政策,仅限非营利学术研究。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作