twitter Dataset

NIAID Data Ecosystem2026-03-12 收录

下载链接：

https://zenodo.org/record/5068252

下载链接

链接失效反馈

官方服务：

资源简介：

The WE1S twitter dataset contains 5,024,756 tweets posted to Twitter between December 6th, 2013 and June 30th, 2019. The dataset is divided into subcollections based on the query terms "humanities", "liberal arts", "stem", "science", and "science-es" (that is a query for the presence of either "science" or "sciences"). Subcollections can be identified in the dataset from the value of the metapath property. The number of tweets in each subcollections is as follows: humanities: 1,705,038 liberal-arts: 7,663 stem: 865,156 science: 2,089,985 science-es: 356,914 The tweets are distributed over the following date range: 2013: 16,335 2014: 862,746 2015: 1,711,823 2016: 947,561 2017: 976,971 2018: 3,24,133 2019: 185,187 Collectively, the tweets represent the work of 1,886,739 distinct usernames. Each tweet's mentions, hashtags, and links are recorded, as well the number of likes and retweets. Unlike most other WE1S datasets, the Twitter dataset does not contain extracted features. Instead, it contains the original text of the tweet (the value of the content property, along with a tidy_tweet property, which contains the text of the tweet after preprocessing. Tweets were preprocessed using a modified form of the WE1S preprocessing algorithm. Details can be found in the WE1S Tweet-Suite repository. (See WE1S Research Materials Overview for the relation between the project's "datasets" and "collections.")

WE1S 推特数据集包含 2013 年 12 月 6 日至 2019 年 6 月 30 日期间发布至推特（Twitter）的 5,024,756 条推文。该数据集根据查询词「人文学科（humanities）」「博雅教育（liberal arts）」「理工科（stem）」「自然科学（science）」以及「science-es」（即用于匹配包含"science"或"sciences"的内容的查询词）划分为多个子数据集，子数据集可通过元路径（metapath）属性的取值在数据集中进行区分。各子数据集的推文数量如下：人文学科（humanities）：1,705,038 条博雅教育（liberal arts）：7,663 条理工科（stem）：865,156 条自然科学（science）：2,089,985 条 science-es：356,914 条该数据集的推文发布量按年份分布如下： 2013 年：16,335 条 2014 年：862,746 条 2015 年：1,711,823 条 2016 年：947,561 条 2017 年：976,971 条 2018 年：324,133 条 2019 年：185,187 条整体而言，这批推文对应 1,886,739 个独立用户名。每条推文的提及对象、话题标签（hashtags）、链接均被记录，同时包含点赞数与转发（retweets）数。与多数其他 WE1S 数据集不同，本推特数据集未包含提取特征，而是存储了推文的原始文本（即 content 属性的取值），以及 tidy_tweet 属性——该属性存储了经过预处理后的推文文本。推文预处理采用了经改进的 WE1S 预处理算法，详细信息可参见 WE1S Tweet-Suite 代码仓库。（有关本项目「数据集」与「集合」之间的关系，请参见《WE1S 研究材料概述》（WE1S Research Materials Overview）。）

创建时间：

2021-07-04

5,000+

优质数据集

54 个

任务类型

进入经典数据集