OMCHOKSI108/cybersecdata
收藏Hugging Face2026-04-26 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/OMCHOKSI108/cybersecdata
下载链接
链接失效反馈官方服务:
资源简介:
Pralay是一个网络安全指令调优数据集,专门用于微调网络安全助理模型。它是一个合并、清理和去重的聊天格式数据集,由OM CHOKSI为Pralay项目构建。数据集包含总计194,318个聊天样本(训练集174,886个,验证集19,432个,约90/10分割),每个样本都是一个遵循OpenAI聊天格式的{system, user, assistant}三元组。数据来源结合了6个公开的Hugging Face网络安全数据集(约204,000行)以及从37本网络安全教科书生成的约11,000页自定义问答对(约11,000个样本)。数据集经过模式验证和基于用户与助理内容哈希的精确去重处理,可直接用于SFTTrainer、unsloth、Llama-Factory等训练工具。数据集旨在用于监督微调中小型开源LLM,适用于网络安全助理用例,并推荐在推理时结合RAG层使用源PDF块。
Pralay is a cybersecurity instruction-tuning dataset, designed for fine-tuning a cybersecurity assistant. It is a merged, cleaned, and deduplicated chat-format dataset built for the Pralay project by OM CHOKSI. The dataset contains a total of 194,318 chat samples (174,886 for training and 19,432 for validation, with a ~90/10 split). Each sample is a {system, user, assistant} triple in OpenAI chat format. It combines 6 public Hugging Face cybersecurity datasets (~204K rows) with custom Q&A generated from 37 cybersecurity textbooks (~11K pages resulting in ~11K samples). The dataset is schema-validated and exact-deduplicated by user+assistant content hash, ready for direct use with tools like SFTTrainer, unsloth, and Llama-Factory. It is intended for supervised fine-tuning of small to mid-sized open LLMs for cybersecurity assistant use cases, and it is recommended to pair it with a RAG layer over the source PDFs at inference time.
提供机构:
OMCHOKSI108



