OMCHOKSI108/cybersecdata

Name: OMCHOKSI108/cybersecdata
Creator: OMCHOKSI108
Published: 2026-04-26 04:32:32
License: 暂无描述

Hugging Face2026-04-26 更新2026-05-03 收录

下载链接：

https://hf-mirror.com/datasets/OMCHOKSI108/cybersecdata

下载链接

链接失效反馈

官方服务：

资源简介：

Pralay是一个网络安全指令调优数据集，专门用于微调网络安全助理模型。它是一个合并、清理和去重的聊天格式数据集，由OM CHOKSI为Pralay项目构建。数据集包含总计194,318个聊天样本（训练集174,886个，验证集19,432个，约90/10分割），每个样本都是一个遵循OpenAI聊天格式的{system, user, assistant}三元组。数据来源结合了6个公开的Hugging Face网络安全数据集（约204,000行）以及从37本网络安全教科书生成的约11,000页自定义问答对（约11,000个样本）。数据集经过模式验证和基于用户与助理内容哈希的精确去重处理，可直接用于SFTTrainer、unsloth、Llama-Factory等训练工具。数据集旨在用于监督微调中小型开源LLM，适用于网络安全助理用例，并推荐在推理时结合RAG层使用源PDF块。

Pralay is a cybersecurity instruction-tuning dataset, designed for fine-tuning a cybersecurity assistant. It is a merged, cleaned, and deduplicated chat-format dataset built for the Pralay project by OM CHOKSI. The dataset contains a total of 194,318 chat samples (174,886 for training and 19,432 for validation, with a ~90/10 split). Each sample is a {system, user, assistant} triple in OpenAI chat format. It combines 6 public Hugging Face cybersecurity datasets (~204K rows) with custom Q&A generated from 37 cybersecurity textbooks (~11K pages resulting in ~11K samples). The dataset is schema-validated and exact-deduplicated by user+assistant content hash, ready for direct use with tools like SFTTrainer, unsloth, and Llama-Factory. It is intended for supervised fine-tuning of small to mid-sized open LLMs for cybersecurity assistant use cases, and it is recommended to pair it with a RAG layer over the source PDFs at inference time.

提供机构：

OMCHOKSI108

5,000+

优质数据集

54 个

任务类型

进入经典数据集