williegeodev/dga-compliance-dataset
收藏Hugging Face2026-04-25 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/williegeodev/dga-compliance-dataset
下载链接
链接失效反馈官方服务:
资源简介:
该数据集包含2,000,000条域名记录,其中包含从二级域名字符串中手工提取的20个词法特征,用于对域生成算法(DGA)生成的域名与合法域名进行二元分类。数据集完美平衡,包括从ExtraHop golden-rules语料库采样的1,000,000个DGA域名字符串和从Tranco top-one-million流行度排名列表采样的1,000,000个良性域名字符串。特征涵盖四个类别:基于长度的属性、字符组成统计、比率和熵测量,以及最大连续字符运行长度。二元标签列将DGA生成的域名编码为1,良性域名编码为0。在相关实验中,数据集使用random_state=42进行了80/20分层训练-测试分割。
This dataset contains 2,000,000 domain name records with 20 handcrafted lexical features extracted from second-level domain strings, constructed for binary classification of Domain Generation Algorithm (DGA) generated domains against legitimate domains. The dataset is perfectly balanced, comprising 1,000,000 DGA domain strings sampled from the ExtraHop golden-rules corpus and 1,000,000 benign domain strings sampled from the Tranco top-one-million popularity ranking list. Features span four categories: length-based properties, character composition statistics, ratio and entropy measures, and maximum consecutive character run lengths. The binary label column encodes DGA-generated domains as 1 and benign domains as 0. The dataset was partitioned using an 80/20 stratified train-test split with random_state=42 in all associated experiments.
提供机构:
williegeodev



