RISys-Lab/RedSage-CFW

Name: RISys-Lab/RedSage-CFW
Creator: RISys-Lab
Published: 2026-02-05 11:42:32
License: 暂无描述

Hugging Face2026-02-05 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/RISys-Lab/RedSage-CFW

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: - config_name: chunk_1 features: - name: text dtype: string - name: id dtype: string - name: metadata struct: - name: probability dtype: float64 - name: relevant dtype: bool - name: dump dtype: string - name: url dtype: string - name: date dtype: timestamp[ms] - name: file_path dtype: string - name: language dtype: string - name: language_score dtype: float64 - name: token_count dtype: int64 - name: idx dtype: int64 - name: score dtype: float64 - name: int_score dtype: int64 splits: - name: train num_bytes: 12928572805 num_examples: 2644168 download_size: 7225481499 dataset_size: 12928572805 - config_name: chunk_2 features: - name: text dtype: string - name: id dtype: string - name: metadata struct: - name: probability dtype: float64 - name: relevant dtype: bool - name: dump dtype: string - name: url dtype: string - name: date dtype: timestamp[ms] - name: file_path dtype: string - name: language dtype: string - name: language_score dtype: float64 - name: token_count dtype: int64 - name: idx dtype: int64 - name: score dtype: float64 - name: int_score dtype: int64 splits: - name: train num_bytes: 12442322111 num_examples: 2644168 download_size: 6946939349 dataset_size: 12442322111 - config_name: chunk_3 features: - name: text dtype: string - name: id dtype: string - name: metadata struct: - name: probability dtype: float64 - name: relevant dtype: bool - name: dump dtype: string - name: url dtype: string - name: date dtype: timestamp[ms] - name: file_path dtype: string - name: language dtype: string - name: language_score dtype: float64 - name: token_count dtype: int64 - name: idx dtype: int64 - name: score dtype: float64 - name: int_score dtype: int64 splits: - name: train num_bytes: 12298903026 num_examples: 2644168 download_size: 6838364994 dataset_size: 12298903026 - config_name: chunk_4 features: - name: text dtype: string - name: id dtype: string - name: metadata struct: - name: probability dtype: float64 - name: relevant dtype: bool - name: dump dtype: string - name: url dtype: string - name: date dtype: timestamp[ms] - name: file_path dtype: string - name: language dtype: string - name: language_score dtype: float64 - name: token_count dtype: int64 - name: idx dtype: int64 - name: score dtype: float64 - name: int_score dtype: int64 splits: - name: train num_bytes: 12366516316 num_examples: 2644168 download_size: 6878740053 dataset_size: 12366516316 - config_name: chunk_5 features: - name: text dtype: string - name: id dtype: string - name: metadata struct: - name: probability dtype: float64 - name: relevant dtype: bool - name: dump dtype: string - name: url dtype: string - name: date dtype: timestamp[ms] - name: file_path dtype: string - name: language dtype: string - name: language_score dtype: float64 - name: token_count dtype: int64 - name: idx dtype: int64 - name: score dtype: float64 - name: int_score dtype: int64 splits: - name: train num_bytes: 11968043495 num_examples: 2644168 download_size: 6734425417 dataset_size: 11968043495 configs: - config_name: chunk_1 data_files: - split: train path: chunk_1/train-* - config_name: chunk_2 data_files: - split: train path: chunk_2/train-* - config_name: chunk_3 data_files: - split: train path: chunk_3/train-* - config_name: chunk_4 data_files: - split: train path: chunk_4/train-* - config_name: chunk_5 data_files: - split: train path: chunk_5/train-* license: odc-by task_categories: - text-generation language: - en tags: - cybersecurity - pretraining pretty_name: RedSage-CFW size_categories: - 10M<n<100M --- # Dataset Card for RedSage-CFW RedSage: A Cybersecurity Generalist LLM" (ICLR 2026). Authors: Naufal Suryanto1, Muzammal Naseer1†, Pengfei Li1, Syed Talal Wasim2, Jinhui Yi2, Juergen Gall2, Paolo Ceravolo3, Ernesto Damiani3 1Khalifa University, 2University of Bonn, 3University of Milan †Project Lead <a href="https://openreview.net/forum?id=W4FAenIrQ2"><img src="https://img.shields.io/badge/Paper-OpenReview-B31B1B.svg"></a> <a href="https://huggingface.co/RISys-Lab"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-RISys--Lab-orange"></a> 🌐 <a href="https://risys-lab.github.io/RedSage/">Project Page</a>  |   🤖 <a href="https://huggingface.co/collections/RISys-Lab/redsage-models">Model Collection</a>  |   📊 <a href="https://huggingface.co/collections/RISys-Lab/redsage-benchmarks">Benchmark Collection</a>  |   📘 <a href="https://huggingface.co/collections/RISys-Lab/redsage-datasets">Data Collection </a> **** ## Dataset Description * **Developed by:** RISysLab * **Repository:** [GitHub](https://github.com/RISys-Lab/RedSage) * **Paper:** [RedSage: A Cybersecurity Generalist LLM](https://openreview.net/forum?id=W4FAenIrQ2) * **Arxiv:** https://arxiv.org/abs/2601.22159 ### Dataset Summary **RedSage-CFW** (CyberFineWeb) is a large-scale, cybersecurity dataset designed for the continual pretraining of Large Language Models (LLMs). It consists of approximately **11.7 billion tokens** spanning **13 million documents**. The dataset was constructed by filtering the **FineWeb** corpus (Common Crawl 2013–2024) using a custom ModernBERT-based classifier to identify cybersecurity-relevant content. To prevent catastrophic forgetting of general capabilities during pretraining, the cybersecurity data is mixed with general educational content from **FineWeb-Edu**. ### Supported Tasks * **Continual Pretraining:** Designed to adapt general-purpose LLMs (e.g., Qwen, Llama) to the cybersecurity domain. * **Domain Adaptation:** Enhances model performance on cybersecurity knowledge, skills, and tool usage ### Languages The dataset primarily consists of English text, derived from Common Crawl sources. ## Dataset Structure ### Data Instances The dataset is partitioned into 5 chunks (config names: `chunk_1` through `chunk_5`). Each instance represents a single document (e.g., a web page, article, or forum post). ### Data Fields Based on the provided configuration, the data fields are: * **`text`** (string): The full text content of the document. * **`id`** (string): A unique identifier for the document. * **`metadata`** (struct): Contains detailed attributes about the source and filtering: * `probability` (float64): The confidence score from the cybersecurity classifier. * `relevant` (bool): A flag indicating if the document passed the relevance filter. * `url` (string): The source URL of the document. * `date` (timestamp): The crawl or publication date. * `dump` (string): The Common Crawl dump identifier (e.g., `CC-MAIN-2024-51`). * `file_path` (string): Path information for the original file. * `language` (string): The detected language of the text. * `language_score` (float64): Confidence score of the language detection. * `token_count` (int64): The number of tokens in the document. * `score`, `int_score`: Additional quality or relevance metrics. ### Data Splits The dataset is segmented into 5 chunks. The paper notes that the final corpus consists of the "latest 5 chunks" from the filtered pipeline to fit training budgets. * **Total Size:** ~11.7B tokens. * **Total Documents:** ~13M documents. ## Dataset Creation ### Curation Rationale Existing cybersecurity solutions often rely on proprietary APIs or lack domain adaptation. RedSage-CFW bridges this gap by providing a transparent, open-source corpus for training local, privacy-preserving cybersecurity assistants. ### Source Data * **FineWeb:** The base corpus is FineWeb, aggregated from 104 Common Crawl subsets between Summer 2013 and December 2024 (~17.2T tokens). * **FineWeb-Edu:** Used for mixing general knowledge to maintain reasoning capabilities. ### Data Processing & Filtering 1. **Classifier Training:** A binary classifier based on **ModernBERT-base** was trained on the "Cybersecurity Topic Classification" dataset (sourced from Reddit, StackExchange, and arXiv). It achieved 97.3% accuracy on validation. 2. **Filtering:** This classifier was applied to FineWeb, identifying \~125M cybersecurity-relevant documents (\~89.8B tokens). 3. **General Knowledge Replay:** To avoid catastrophic forgetting, the cybersecurity data was mixed with FineWeb-Edu samples at a **30% replay ratio**. 4. **Deduplication:** Global deduplication was performed using MinHash-LSH (via DataTrove), reducing the corpus size by ~47.9% in tokens. 5. **Chunking:** The final dataset comprises the latest 5 chronological chunks from the processed data to manage computational costs. ## Considerations for Using the Data ### Social Impact The dataset enables the development of open-source cybersecurity assistants, potentially helping to bridge the global skills shortage in the field. ### Discussion of Biases and Limitations * **Source Bias:** As a web-crawled dataset, it may inherit biases present in Common Crawl and online cybersecurity discussions. * **Dual Use:** The dataset may contains offensive security knowledge (e.g., penetration testing techniques). While intended for defense, there is an inherent risk of misuse. --- ## Citation ```bibtex @inproceedings{suryanto2026redsage, title={RedSage: A Cybersecurity Generalist {LLM}}, author={Naufal Suryanto and Muzammal Naseer and Pengfei Li and Syed Talal Wasim and Jinhui Yi and Juergen Gall and Paolo Ceravolo and Ernesto Damiani}, booktitle={The Fourteenth International Conference on Learning Representations}, year={2026}, url={https://openreview.net/forum?id=W4FAenIrQ2} } ```

提供机构：

RISys-Lab

5,000+

优质数据集

54 个

任务类型

进入经典数据集