Set of obfuscated spam dataset by using LeetSpeak transformations

NIAID Data Ecosystem2026-03-13 收录

下载链接：

https://zenodo.org/record/6373652

下载链接

链接失效反馈

官方服务：

资源简介：

The usage of LeetSpeak and other text hiding tricks is often used by spammers in the distribution of unsolicited contents. To evaluate deobfuscation techniques and their impact on spam content classification, we preprocessed several popular public datasets to partially obfuscate the text. The datasets transformed are: YouTube Spam Collection [2, 3] which is available on https://www.dt.fee.unicamp.br/~tiago/youtubespamcollection/. a subset of YouTube Comments [4, 5] which is available on http://mlg.ucd.ie/yt/. CSDMC2010 which is available on http://csmining.org/index.php/spam-email-datasets-.html. TREC2007 which is available on https://plg.uwaterloo.ca/~gvcormac/treccorpus07/

Leet语（LeetSpeak）及其他文本隐藏技巧常被垃圾邮件发送者用于分发未经请求的内容。为评估去混淆技术及其对垃圾邮件内容分类的影响，我们对多个主流公开数据集进行预处理，对其文本实施部分混淆操作。经转换的数据集如下： YouTube垃圾邮件数据集（YouTube Spam Collection）[2, 3]，可从https://www.dt.fee.unicamp.br/~tiago/youtubespamcollection/ 获取。 YouTube评论数据集（YouTube Comments）的一个子集[4, 5]，可从http://mlg.ucd.ie/yt/ 获取。 CSDMC2010数据集，可从http://csmining.org/index.php/spam-email-datasets-.html 获取。 TREC2007数据集，可从https://plg.uwaterloo.ca/~gvcormac/treccorpus07/ 获取。

创建时间：

2022-03-22

5,000+

优质数据集

54 个

任务类型

进入经典数据集