five

Harvard CGA Geotweet Archive v2.0

收藏
DataONE2023-08-28 更新2024-06-08 收录
下载链接:
https://search.dataone.org/view/sha256:f63b9bff0902a76f3e19acbc6f45157d74636dcd82eded59c3c20fd8de2c29b2
下载链接
链接失效反馈
官方服务:
资源简介:
Geotweet Archive v2.0 The Harvard Center for Geographic Analysis (CGA) maintains the Geotweet Archive, a global record of tweets spanning time, geography, and language. The primary purpose of the Archive is to make a comprehensive collection of geo-located tweets available to the academic research community. The Archive extends from 2010 to the present and is updated daily. The number of tweets in the collection totals approximately 10 billion, and it is stored on Harvard University’s High Performance Computing (HPC) cluster. The Harvard HPC supports many applications for working with big spatio-temporal datasets, including two geospatial tools recently deployed by the CGA: OmniSci Immerse, and PostGIS. The Geotweet Archive consists of tweets which carry two types of geospatial signature: 1) GPS-based longitude/latitude generated by the originating device 2) Place-name-centroid-based longitude/latitude from the bounding box provided by Twitter, based on the user-define place designation (typically a town name). Any tweet which carries one or both of these signatures is included in the Archive. Approximately 1-2% of all tweets contain such geographic coordinates, (this percentage needs verification and may vary over time). The current version of the Archive is Version 2.0. The original Version 1.0 archive began in 2012 as part of a project with Ben Lewis of CGA and then Harvard graduate student Todd Mostak, to develop a GPU-powered spatial database called GEOPS. GEOPS formed the basis for technology startup MapD Technologies, which is now OmniSci. OmniSci Immerse software now runs on Harvard’s High Performance Computing (HPC) environment to support interactive exploration and analytics with the Geotweet Archive and any other large datasets. Version 2.0 of the archive represents the results of a merge between the CGA archive, and an archive developed by the Department of Geoinformatics at the University of Salzburg in Austria, as well as several other archives. Clemens Havas and Bernd Resch at University of Salzburg, and Devika Kakkar of Harvard CGA collaborated to deploy Version 2.0. ======================================================== Schema of Geotweet Archive v2.0 Field name____TYPE____Description message_id----BIGINT----Tweet ID tweet_date----TIMESTAMP----Date and time of tweet from Twitter (utc) tweet_text----TEXT ENCODING----Text content of tweet tags----TEXT ENCODING DICT----Tweet hashtags tweet_lang----TEXT ENCODING DICT----Language that the tweet is in source ----TEXT ENCODING DICT----Operating system or application type used to create the tweet place*----TEXT ENCODING NONE----The geographic place as defined by the user, usually a town name. A bounding box determined by Twitter based on this field, from which centroids (see longitude and latitude fields) and the spatial_error field are derived, and used when not overridden by a GPS coordinate. See Twitter tweet object for place. retweets ----SMALLINT----Number of retweets as of last time it was checked tweet_favorites----SMALLINT----Now known as ‘likes’ photo_url----TEXT ENCODING DICT----URL of any image referenced quoted_status_id ----BIGINT----ID number for quote status user_id ----BIGINT----User ID number user_name----TEXT ENCODING NONE----User name user_location*----TEXT ENCODING NONE----User defined location, usually a city or town. See Twitter user object. followers ----SMALLINT----Followers as of the last time checked friends ----SMALLINT----Number of users followed by this user user_favorites----INT----Number of topics the user is interested in status----INT----Code for what user is doing as of last time it was checked user_lang----TEXT ENCODING DICT----User defined language latitude----FLOAT----Latitude from GPS or bounding box based on Place field longitude----FLOAT----Longitude from GPS or bounding box based on Place field data_source*----TEXT ENCODING DICT----The source crawler or dataset for the tweet gps----TEXT ENCODING DICT----Flag for whether lon/lat is from GPS or town name bounding box (SRID – 4326). When both are present, the GPS coordinate takes priority. spatialerror----FLOAT----Estimate in meters horizontal error for lon/lat coordinate. 10m for GPS coordinates, error for bounding boxes calculated as radius of circle with area of bounding box. ===================================================== *data_source____Code U. Salzburg REST API crawler----1 Harvard CGA streaming crawler----2 U. Salzburg streaming API crawler----3 Ryan Qi Wang and Harvard Medical School datasets----4 U. Heidelberg dataset----5 Archive.org dataset----6 ---------------------------------------------------------------------------------------------- Note: Before April of 2015 the default for GPS coordinate capture was turned on for Twitter users. After this date users have had to opt-in to share their precise location. This is one reason for the large decrease in volume of geotweets after this date. A number of automated...
创建时间:
2023-11-21
搜集汇总
数据集介绍
main_image_url
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作