TweetsCOV19 - A Semantically Annotated Corpus of Tweets About the COVID-19 Pandemic (Part 2, May 2020)
收藏NIAID Data Ecosystem2026-03-12 收录
下载链接:
https://zenodo.org/record/4593501
下载链接
链接失效反馈官方服务:
资源简介:
TweetsCOV19 is a semantically annotated corpus of Tweets about the COVID-19 pandemic. It is a subset of TweetsKB and aims at capturing online discourse about various aspects of the pandemic and its societal impact. Metadata information about the tweets as well as extracted entities, sentiments, hashtags, user mentions, and resolved URLs are exposed in RDF using established RDF/S vocabularies*.
We also provide a tab-separated values (tsv) version of the dataset. Each line contains features of a tweet instance. Features are separated by tab character ("\t"). The following list indicate the feature indices:
Tweet Id: Long.
Username: String. Encrypted for privacy issues*.
Timestamp: Format ( "EEE MMM dd HH:mm:ss Z yyyy" ).
#Followers: Integer.
#Friends: Integer.
#Retweets: Integer.
#Favorites: Integer.
Entities: String. For each entity, we aggregated the original text, the annotated entity and the produced score from FEL library. Each entity is separated from another entity by char ";". Also, each entity is separated by char ":" in order to store "original_text:annotated_entity:score;". If FEL did not find any entities, we have stored "null;".
Sentiment: String. SentiStrength produces a score for positive (1 to 5) and negative (-1 to -5) sentiment. We splitted these two numbers by whitespace char " ". Positive sentiment was stored first and then negative sentiment (i.e. "2 -1").
Mentions: String. If the tweet contains mentions, we remove the char "@" and concatenate the mentions with whitespace char " ". If no mentions appear, we have stored "null;".
Hashtags: String. If the tweet contains hashtags, we remove the char "#" and concatenate the hashtags with whitespace char " ". If no hashtags appear, we have stored "null;".
URLs: String: If the tweet contains URLs, we concatenate the URLs using ":-: ". If no URLs appear, we have stored "null;"
To extract the dataset from TweetsKB, we compiled a seed list of 268 COVID-19-related keywords.
* For the sake of privacy, we anonymize user IDs and we do not provide the text of the tweets.
TweetsCOV19是一款针对新冠疫情相关推文的语义标注语料库,其作为TweetsKB的子集,旨在捕捉围绕疫情各维度及其社会影响的线上讨论。推文中的元数据信息、抽取得到的实体、情感、话题标签、用户提及内容以及解析后的统一资源定位符(URL)均采用资源描述框架(RDF),并通过已标准化的RDF/S词汇表进行呈现。
本数据集同时提供制表符分隔值(Tab-Separated Values, TSV)格式版本。每行对应一条推文实例的特征,特征以制表符(" ")分隔。下文列出各特征的索引及说明:
- 推文ID:长整型(Long)。
- 用户名:字符串类型,出于隐私保护已进行加密处理*。
- 时间戳:格式为"EEE MMM dd HH:mm:ss Z yyyy"。
- 粉丝数:整型。
- 关注数:整型。
- 转发数:整型。
- 点赞数:整型。
- 实体:字符串类型。针对每个实体,我们整合了原始文本、标注实体以及从FEL库获取的关联得分。实体间以分号("; ")分隔,每个实体内部的三项信息以冒号(":")分隔,格式为"original_text:annotated_entity:score;"。若FEL未识别到任何实体,则存储"null;"。
- 情感:字符串类型。SentiStrength工具会分别输出正面情感(取值范围1至5)与负面情感(取值范围-1至-5)的得分,二者以空格分隔,且正面得分在前、负面得分在后(例如"2 -1")。
- 提及对象:字符串类型。若推文中包含用户提及内容,将移除"@"符号后以空格拼接所有提及用户名;若无提及内容,则存储"null;"。
- 话题标签:字符串类型。若推文中包含话题标签,将移除"#"符号后以空格拼接所有标签;若无话题标签,则存储"null;"。
- 统一资源定位符(URL):字符串类型。若推文中包含URL,将以":-: "拼接所有链接;若无URL,则存储"null;"
为从TweetsKB中抽取本数据集,我们构建了包含268个新冠疫情相关关键词的种子词表。
* 为保护隐私,我们对用户ID进行了匿名化处理,且不提供推文原文。
创建时间:
2021-03-11



