自动标注的Twitter COVID-19数据集
收藏arXiv2021-07-27 更新2024-06-21 收录
下载链接:
https://github.com/thepanacealab/annotated_twitter_covid19_dataset
下载链接
链接失效反馈官方服务:
资源简介:
本数据集名为‘自动标注的Twitter COVID-19数据集’,由乔治亚州立大学计算机科学系创建,包含226,582,903条经过自动标注的英文推文。数据集通过使用多种NLP工具,如MedSpaCy、MedaCy等,对推文中的生物医学实体进行识别和标注。创建过程中,首先筛选出与COVID-19相关的推文,然后使用不同的NLP框架进行自动标注。该数据集主要用于生物医学研究,旨在通过分析社交媒体数据来理解COVID-19的公共卫生影响和患者报告的症状。
This dataset, titled "Automatically Annotated Twitter COVID-19 Dataset", was developed by the Department of Computer Science at Georgia State University, and comprises 226,582,903 automatically annotated English tweets. To recognize and annotate biomedical entities within the tweets, multiple NLP tools including MedSpaCy, MedaCy, and others were utilized during dataset construction. In the creation workflow, tweets relevant to COVID-19 were first screened, after which automatic annotation was performed using diverse NLP frameworks. This dataset is primarily designed for biomedical research, aiming to analyze social media data to comprehend the public health impacts of COVID-19 and symptoms reported by patients.
提供机构:
乔治亚州立大学计算机科学系
创建时间:
2021-07-27



