Indonesian Tweets Dataset for Identifying Emotion Changes Among Twitter Users Following the Onset of the COVID-19

Mendeley Data2024-03-27 更新2024-06-26 收录

下载链接：

https://data.mendeley.com/datasets/x8t4gn6mt6

下载链接

链接失效反馈

官方服务：

资源简介：

Tweet data was collected using the Twitter API services based on a point location with a radius of 10 km to obtain high tweet intensity in strategic locations. The location for data collection was the Setiabudi area, Jakarta, Indonesia, obtained by rounding the latitude value to -6.22 and longitude to 106.83; this location point was chosen via probability sampling to enhance the analysis process, as this area was most affected by COVID-19 cases in Indonesia during the pandemic. Data collection was divided into two periods: before the COVID-19 outbreak (i.e., December 2019 to March 2020) and the beginning of the COVID-19 outbreak (i.e., March 2020 to June 2020). This study considered the first day of the COVID-19 pandemic in Indonesia to be March 14, 2020, according to the rules released by the Indonesian government. This mechanism resulted large and varied data, as a data filtering process based on a specific context was not conducted. Then, our work performed three steps of data reduction to obtain the appropriate data dimensions; a) select active users based on tweet intensity, b) remove tweet data with a word count below five, and c) eliminate data based on the suitability of discussion topics. The data used in the modeling process has passed three stages of data reduction. Linear with our work, there are three labeling processes: discussion topic, emotion and sentiment. For discussion topic labeling, this data performed a topic modeling mechanism using the LDA algorithm. On the other hand, for emotion and sentiment labeling, three annotators manually labeled the data and used the majority vote strategy for the final class label on sample data. In our annotation strategy, for emotion labeling, each annotator was asked to annotate the individual tweets as "Happiness", "Love", "Fear", "Sadness", and "Anger". While for sentiment labeling, each tweet has been annotated into three predetermined category, namely "Positive", "Negative", and "Neutral".

本数据集通过推特API（Twitter API），以点位为中心、半径10公里的范围采集推文数据，旨在获取战略点位的高推文活跃度。本次数据采集的点位位于印度尼西亚雅加达塞蒂亚布迪（Setiabudi）区域，经坐标取整后纬度为-6.22、经度为106.83；该点位通过概率抽样选取，以优化分析流程，因该区域在新冠疫情期间是印度尼西亚受新冠疫情影响最严重的区域。数据采集分为两个阶段：新冠疫情暴发前（2019年12月至2020年3月）与新冠疫情暴发初期（2020年3月至2020年6月）。根据印度尼西亚政府发布的官方规定，本研究将印度尼西亚国内新冠大流行的首日定为2020年3月14日。由于未执行基于特定上下文的数据过滤流程，初始数据集规模庞大且类型多样。为此，本研究通过三轮数据降维处理以获取适配的数据集维度：a）基于推文活跃度筛选活跃用户；b）移除单词数少于5的推文数据；c）根据讨论主题的适配性剔除无关数据。建模环节所使用的数据均经过上述三轮数据降维处理。与之对应，本研究共开展三类标注任务：讨论主题标注、情绪标注与情感极性标注。其中，讨论主题标注采用潜在狄利克雷分配（Latent Dirichlet Allocation，LDA）算法实现主题建模；针对情绪与情感极性标注任务，本研究聘请三名标注人员对数据进行人工标注，并采用多数投票策略确定样本数据的最终类别标签。在标注规则中，情绪标注要求每位标注人员将单条推文归为「快乐（Happiness）」、「喜爱（Love）」、「恐惧（Fear）」、「悲伤（Sadness）」与「愤怒（Anger）」五类；而情感极性标注则将单条推文划分为三类预设类别：「积极（Positive）」、「消极（Negative）」与「中性（Neutral）」。

创建时间：

2024-01-23