A Large Dataset of Tweets on the 2023 Presidential Elections in Nigeria for Natural Language Processing Tasks
收藏NIAID Data Ecosystem2026-05-01 收录
下载链接:
https://zenodo.org/record/8347220
下载链接
链接失效反馈官方服务:
资源简介:
The dataset contains tweets related to the 2023 presidential elections in Nigeria. The data was retrieved from the social media network, Twitter (Now X) between February 4th, 2023 and April 4th, 2023. The hashtags from the official handles and other popular hashtags endorsed and/or representing the candidates of each party were considered for retrieving election related tweets using an API from Twitter social media platform. Three major political parties in Nigeria were considered and they have been labelled as Party A, Party L and Party P in this dataset. The party or group called "General" contains tweets from the Independent National Electoral Commission (INEC) hashtags such as @inecnigeria and #2023election which is not directly for any political party.
The dataset has been pre-processed lightly to make it very useful to researcher for a wide range of natural language processing tasks like sentiment analysis, topic modelling, fake news detection, emotion detection, election stance, etc.
Details of the dataset collection such as hashtags, retrieved tweets, duplicates removed, and the remaining unique tweets is presented in Table 1.
Table 1: Tweets collection and duplicates removal
S/N
Party
Hash tags
Retrieved tweets
Duplicates tweets
Unique tweets
1
X
@inecnigeria
#2023election
64,496
47,275
17,195
2
A
#TinubuIsComing
#emilokan
#jagabanarmy
#RenewedHope
#BATKSM2023
263,870
231,036
32,832
3
L
#VoteLP
#NigeriaMustBeBright
#PeterObiForPresident2023
#ObiDatti2023
#PeterObi
664,083
310,857
353,226
4
P
#NigeriaDecides
#VotePDP
#AtikuOkowa2023
#FinalPushToVictory
#RecoverNigeria
387,450
318,425
66,227
1,379,899
907,593
468,480
To encourage NLP tasks, we uploaded in this Version One the following files:
The combined dataset with pre-processed tweets and their meta data but with removed duplicates are in the file labelled “Combined Dataset Pre-processed without duplicates.csv”
General statistics on each corpus is in the file labelled “Dataset Statistics.xlsx”
The preprocessed corpus from the general group with the tweet contents only is in file labelled “Preprocessed_Tweet only_GENERAL.xlsx”
The preprocessed corpus from Party A with the tweet contents only is in file labelled “Preprocessed_Tweet only_Party A.xlsx”
The preprocessed corpus from Party L with the tweet contents only is in file labelled “Preprocessed_Tweet only_Party L.xlsx”
The preprocessed corpus from Party P with the tweet contents only is in file labelled “Preprocessed_Tweet only_Party P.xlsx”
The top 100 frequent tokens are in the file labelled “Top 100 Tokens and weights.xlsx”
The top frequent bigrams and their weights are in the file labelled “Top 100 Bigrams and weights.xlsx”
The top frequent trigrams and their weights are in the file labelled “Top 100 Trigrams and weights.xlsx”
创建时间:
2023-09-17



