five

SenTopX: A Benchmark Twitter Dataset for User Sentiment on Various Topics

收藏
NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/11243661
下载链接
链接失效反馈
官方服务:
资源简介:
This is a longitudinal Twitter dataset of 143K users during the period 2017-2021. The following is the detail of all the files: SenTopX_userIDs.txt: contains user IDs of 143K Twitter users. userIDs_tweetIDs.zip: contains Tweet IDs of users, the name of the file is the user ID and the file contains the list of all the tweet IDs. users_16_perspective_toxicity_scores.csv contains user IDs and 16 median Perspective API scores, the vector is shared as mean, median, and Gini Index of scores calculated over all tweets of a user. LDAvis_top30_words_for_extracted_topics.csv contains the top 30 most relevant words extracted from each topic extracted by tweet-level topic modeling using the BERTweet topic model. topic_modelling_statistics_per_user.csv contains important and relevant statistics related to topic modeling results: 1. user: This column represents the identifier for the user. Each row in the CSV corresponds to a specific user, and this column helps to track and differentiate between the users. 2. avg_topic_probability: This column contains the average probability of the topics for each user calculated across all of the tweets in order to compare users in a meaningful way. It represents the average likelihood that a particular user discusses various topics over the observed period. 3. maximum_topic_avg: This column holds the value of the highest average probability among all topics for each user. It indicates the topic that the user most frequently discusses, on average. 4. index_max_avg_topic_probability_200: This column specifies the index or identifier of the topic with the highest average probability out of 200 possible topics. It shows which topic (out of 200) the user discusses the most. 5. global_avg: This column includes the global average probability of topics across all users. It provides a baseline or overall average topic probability that can be used for comparative purposes. 6. max_global_avg: This column contains the maximum global average probability across all topics for all users. It identifies the most discussed topic across the entire user base. 7. index_max_global_avg: This column shows the index or identifier of the topic with the highest global average probability. It indicates which topic (out of 200) is the most popular across all users. 8. entropy_200_topic: This column represents the entropy of the topics for each user, calculated over 200 topics. Entropy measures the diversity or unpredictability in the user's discussion of topics, with higher entropy indicating more varied topic discussion. In summary, these columns are used to analyze the topic engagement and preferences of users on a platform, highlighting the most frequently discussed topics, the variability in topic discussions, and how individual user behavior compares to overall trends.
创建时间:
2024-05-27
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作