Rule-Based SQL Injection (RbSQLi) Dataset
收藏DataCite Commons2026-04-20 更新2026-05-04 收录
下载链接:
https://data.mendeley.com/datasets/xz4d5zj5yw
下载链接
链接失效反馈官方服务:
资源简介:
The RbSQLi dataset has been developed to support advanced research and development in the detection of SQL injection (SQLi) vulnerabilities. It contains a total of 10,190,450 structured entries, out of which 2,699,570 are labeled as malicious and 7,490,880 as benign. The malicious entries are categorized into six distinct types of SQL injection attacks: Union-based (398,070 samples), Stackqueries-based (223,800 samples), Time-based (564,900 samples), Meta-based (481,280 samples), Boolean-based (207,900 samples), and Error-based (823,620 samples).
The malicious payloads for Union-based, Time-based, and Error-based injection types were sourced directly from the widely used and reputable open-source GitHub repository "Payloads All The Things – SQL Injection Payload List" (https://github.com/payloadbox/sql-injection-payload-list). Moreover, ChatGPT was employed to generate additional payloads for Boolean-based, Stack queries-based, and Meta-based injection categories. This hybrid approach ensures that the dataset reflects both known attack patterns and intelligently simulated variants, contributing to a broader representation of SQLi techniques. Again, some queries in the SQLi dataset are syntactically invalid yet contain malicious payloads, enabling models to detect SQL injection attempts even when attackers submit improperly formed or malformed queries. This highlights the importance of training models to recognize semantic intent rather than relying solely on syntactic correctness.
All payloads were carefully curated, anonymized, and structured during preprocessing. Sensitive data was replaced with secure placeholders, preserving semantic meaning while protecting data integrity and privacy. The dataset also underwent a thorough sanitization process to ensure consistency and usability. To support scalability and reproducibility, a rule-based classification algorithm was used to automate the labeling and organization of each payload by type. This methodology promotes standardization and ensures that the dataset is ready for use in machine learning pipelines, anomaly detection models, and intrusion detection systems. In addition to being comprehensive, the dataset provides a substantial volume of clean (benign) data, making it well-suited for supervised learning, comparative experiments, and robustness testing in cybersecurity research.
This dataset is intended to advance accurate and generalizable SQL injection detection and serve as a benchmark for security and ML research. It also supports multi-class classification, robustness testing against obfuscated inputs, and sequence modeling (e.g., transformers), as well as semi/self-supervised learning. It can also be used to develop real-time IDS/WAF systems, enable explainable AI, and study data augmentation, transfer learning, and cross-dataset generalization.
提供机构:
Mendeley Data
创建时间:
2025-05-23



