Dataset of Arabic Spam and Ham Tweets
收藏NIAID Data Ecosystem2026-05-01 收录
下载链接:
https://data.mendeley.com/datasets/86x733xkb8
下载链接
链接失效反馈官方服务:
资源简介:
This paper is a descriptor for the dataset to be cited when you use the data:
Kaddoura, Sanaa, and Safaa Henno. "Dataset of Arabic spam and ham tweets." Data in Brief 52 (2024): 109904.
Paper Link: https://www.sciencedirect.com/science/article/pii/S2352340923009472
The data was analyzed in this article:
Kaddoura, S., Alex, S. A., Itani, M., Henno, S., AlNashash, A., & Hemanth, D. J. (2023). Arabic spam tweets classification using deep learning. Neural Computing and Applications, 1-14.
The data are collected from Twitter using Twitter API between January 27, 2021, and March 10, 2021. The download tweet information is Tweet ID, DateTime, URL, Tweet Text, User Name, Location, Replied Tweet ID, Replied Tweet User ID, Replied Tweet Username, Retweet Count, Favorite Count, and Favorited.
The dataset contains two file.
The first file is "Dataset of Arabic Spam and Ham Tweets.xlsx.": This file contains the original collected dataset. The dataset contains 13241 records. Each record represents a tweet. The tweets are labeled either Ham or Spam. Ham means non-spam tweet. There are 1924 Spam tweets and 11299 Ham tweets. The tweets are unique i.e. there are no repeated tweets records.
The second file is "Augmented_SpamHamTweets.xlsx": on this dataset, contextual augmentation was applied to increase the number of the minority class which is the "spam" class. This file will help while applying machine learning to the dataset to get better and more reliable results. This dataset now contains 11030 ham tweets and 15128 spam tweets.
Disclaimer:
This research paper includes certain terms that may be deemed offensive by some individuals. These words have been included solely to represent real-world datasets and research findings accurately. We apologize for any offense caused, as our research is dedicated to advancing fair spam content detection on social media. It's important to note that none of the authors endorse the use of offensive keywords.
本文为该数据集的引用说明,当您使用本数据集时,请引用以下文献:
卡杜拉(Sanaa Kaddoura)与亨诺(Safaa Henno):《阿拉伯垃圾推文与正常推文数据集》,《Data in Brief》,2024年第52卷,第109904号。论文链接:https://www.sciencedirect.com/science/article/pii/S2352340923009472
本数据集的分析文章如下:卡杜拉(S. Kaddoura)、亚历克斯(S. A. Alex)、伊塔尼(M. Itani)、亨诺(S. Henno)、阿尔纳沙什(A. AlNashash)与赫曼斯(D. J. Hemanth)(2023):《基于深度学习的阿拉伯垃圾推文分类》,《Neural Computing and Applications》,第1-14页。
本数据集于2021年1月27日至2021年3月10日期间,通过推特API(Twitter API)从推特(Twitter)平台采集。采集的推文信息包括:推文ID(Tweet ID)、日期时间(DateTime)、链接(URL)、推文文本(Tweet Text)、用户名(User Name)、所在地(Location)、回复推文ID(Replied Tweet ID)、回复推文用户ID(Replied Tweet User ID)、回复推文用户名(Replied Tweet Username)、转发数(Retweet Count)、点赞数(Favorite Count)及点赞状态(Favorited)。
本数据集包含两个文件。第一个文件为《Dataset of Arabic Spam and Ham Tweets.xlsx》:该文件存储原始采集的数据集,共包含13241条记录,每条记录对应一条推文。推文被标记为垃圾(Spam)与正常(Ham)两类,其中Ham指非垃圾推文。本数据集含1924条垃圾推文与11299条正常推文,且所有推文均唯一,无重复记录。
第二个文件为《Augmented_SpamHamTweets.xlsx》:针对该数据集,我们采用上下文增强技术以扩充少数类——垃圾推文类的样本量。该文件可助力数据集的机器学习建模,获得更优质且可靠的实验结果。扩充后的数据集包含11030条正常推文与15128条垃圾推文。
免责声明:本研究论文包含部分可能令部分读者感到冒犯的词汇。此类词汇仅用于真实反映数据集与研究结果,我们为由此带来的冒犯表示歉意。本研究致力于推动社交媒体上公平的垃圾内容检测技术发展,特此说明:所有作者均不支持使用冒犯性关键词。
创建时间:
2024-04-02



