StegPDF-21: A Feature-Engineered Dataset for PDF Steganography Detection

Name: StegPDF-21: A Feature-Engineered Dataset for PDF Steganography Detection
Creator: Mendeley Data
Published: 2026-03-30 18:52:44
License: 暂无描述

DataCite Commons2026-03-30 更新2026-05-04 收录

下载链接：

https://data.mendeley.com/datasets/pfvvrydk96/1

下载链接

链接失效反馈

官方服务：

资源简介：

This dataset, referred to as StegPDF-21, is developed to support research in PDF steganography detection and document-level steganalysis using machine learning techniques. It consists of feature-based representations of PDF documents organized into two classes: clean documents (label = 0), which do not contain hidden information, and steganographic documents (label = 1), in which hidden data has been intentionally embedded. The dataset was constructed using PDF files obtained from a publicly available stress-testing repository. Initially, approximately 32,000 PDF documents were collected. After a validation and cleaning process—where corrupted, encrypted, malformed, and duplicate files were removed—around 20,000 valid documents were retained. From these, 10,000 clean PDFs were selected as the base set for generating steganographic samples. Steganographic documents were generated by applying eight different embedding techniques, each with three payload variants, in order to simulate a range of hiding strategies. A controlled number of samples from each steganographic variant group were selected to maintain class balance and dataset consistency. Each document was then processed using Python-based tools to extract structural and content-related features, including properties of PDF objects, metadata characteristics, and text-based patterns. Initially, 25 features were extracted; following correlation analysis, 4 redundant features were removed, resulting in a final set of 21 numerical features. The dataset is provided in CSV format, where each row corresponds to a document and each column represents a feature, along with a binary label column where 0 denotes clean documents and 1 denotes steganographic documents. The final dataset contains 19,372 instances, with a nearly balanced distribution between the two classes. All processing steps were implemented using automated Python scripts to ensure reproducibility and consistent dataset construction. This dataset can be used for machine learning experiments, cybersecurity research, and the evaluation of PDF steganography detection methods.

提供机构：

Mendeley Data

创建时间：

2026-03-30

5,000+

优质数据集

54 个

任务类型

进入经典数据集