five

RISys-Lab/RedSage-CFW

收藏
Hugging Face2026-02-05 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/RISys-Lab/RedSage-CFW
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: - config_name: chunk_1 features: - name: text dtype: string - name: id dtype: string - name: metadata struct: - name: probability dtype: float64 - name: relevant dtype: bool - name: dump dtype: string - name: url dtype: string - name: date dtype: timestamp[ms] - name: file_path dtype: string - name: language dtype: string - name: language_score dtype: float64 - name: token_count dtype: int64 - name: idx dtype: int64 - name: score dtype: float64 - name: int_score dtype: int64 splits: - name: train num_bytes: 12928572805 num_examples: 2644168 download_size: 7225481499 dataset_size: 12928572805 - config_name: chunk_2 features: - name: text dtype: string - name: id dtype: string - name: metadata struct: - name: probability dtype: float64 - name: relevant dtype: bool - name: dump dtype: string - name: url dtype: string - name: date dtype: timestamp[ms] - name: file_path dtype: string - name: language dtype: string - name: language_score dtype: float64 - name: token_count dtype: int64 - name: idx dtype: int64 - name: score dtype: float64 - name: int_score dtype: int64 splits: - name: train num_bytes: 12442322111 num_examples: 2644168 download_size: 6946939349 dataset_size: 12442322111 - config_name: chunk_3 features: - name: text dtype: string - name: id dtype: string - name: metadata struct: - name: probability dtype: float64 - name: relevant dtype: bool - name: dump dtype: string - name: url dtype: string - name: date dtype: timestamp[ms] - name: file_path dtype: string - name: language dtype: string - name: language_score dtype: float64 - name: token_count dtype: int64 - name: idx dtype: int64 - name: score dtype: float64 - name: int_score dtype: int64 splits: - name: train num_bytes: 12298903026 num_examples: 2644168 download_size: 6838364994 dataset_size: 12298903026 - config_name: chunk_4 features: - name: text dtype: string - name: id dtype: string - name: metadata struct: - name: probability dtype: float64 - name: relevant dtype: bool - name: dump dtype: string - name: url dtype: string - name: date dtype: timestamp[ms] - name: file_path dtype: string - name: language dtype: string - name: language_score dtype: float64 - name: token_count dtype: int64 - name: idx dtype: int64 - name: score dtype: float64 - name: int_score dtype: int64 splits: - name: train num_bytes: 12366516316 num_examples: 2644168 download_size: 6878740053 dataset_size: 12366516316 - config_name: chunk_5 features: - name: text dtype: string - name: id dtype: string - name: metadata struct: - name: probability dtype: float64 - name: relevant dtype: bool - name: dump dtype: string - name: url dtype: string - name: date dtype: timestamp[ms] - name: file_path dtype: string - name: language dtype: string - name: language_score dtype: float64 - name: token_count dtype: int64 - name: idx dtype: int64 - name: score dtype: float64 - name: int_score dtype: int64 splits: - name: train num_bytes: 11968043495 num_examples: 2644168 download_size: 6734425417 dataset_size: 11968043495 configs: - config_name: chunk_1 data_files: - split: train path: chunk_1/train-* - config_name: chunk_2 data_files: - split: train path: chunk_2/train-* - config_name: chunk_3 data_files: - split: train path: chunk_3/train-* - config_name: chunk_4 data_files: - split: train path: chunk_4/train-* - config_name: chunk_5 data_files: - split: train path: chunk_5/train-* license: odc-by task_categories: - text-generation language: - en tags: - cybersecurity - pretraining pretty_name: RedSage-CFW size_categories: - 10M<n<100M --- # Dataset Card for RedSage-CFW <p align="center"> <b> RedSage: A Cybersecurity Generalist LLM" (ICLR 2026). </b> <br> <b>Authors:</b> Naufal Suryanto<sup>1</sup>, Muzammal Naseer<sup>1†</sup>, Pengfei Li<sup>1</sup>, Syed Talal Wasim<sup>2</sup>, Jinhui Yi<sup>2</sup>, Juergen Gall<sup>2</sup>, Paolo Ceravolo<sup>3</sup>, Ernesto Damiani<sup>3</sup> <br> <sup>1</sup>Khalifa University, <sup>2</sup>University of Bonn, <sup>3</sup>University of Milan <br> <sup>†</sup>Project Lead <br> <br> <a href="https://openreview.net/forum?id=W4FAenIrQ2"><img src="https://img.shields.io/badge/Paper-OpenReview-B31B1B.svg"></a> <a href="https://huggingface.co/RISys-Lab"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-RISys--Lab-orange"></a> <br> 🌐 <a href="https://risys-lab.github.io/RedSage/">Project Page</a>&nbsp;&nbsp;|&nbsp;&nbsp; 🤖 <a href="https://huggingface.co/collections/RISys-Lab/redsage-models">Model Collection</a>&nbsp;&nbsp;|&nbsp;&nbsp; 📊 <a href="https://huggingface.co/collections/RISys-Lab/redsage-benchmarks">Benchmark Collection</a>&nbsp;&nbsp;|&nbsp;&nbsp; 📘 <a href="https://huggingface.co/collections/RISys-Lab/redsage-datasets">Data Collection </a> </p> **** ## Dataset Description * **Developed by:** RISysLab * **Repository:** [GitHub](https://github.com/RISys-Lab/RedSage) * **Paper:** [RedSage: A Cybersecurity Generalist LLM](https://openreview.net/forum?id=W4FAenIrQ2) * **Arxiv:** https://arxiv.org/abs/2601.22159 ### Dataset Summary **RedSage-CFW** (CyberFineWeb) is a large-scale, cybersecurity dataset designed for the continual pretraining of Large Language Models (LLMs). It consists of approximately **11.7 billion tokens** spanning **13 million documents**. The dataset was constructed by filtering the **FineWeb** corpus (Common Crawl 2013–2024) using a custom ModernBERT-based classifier to identify cybersecurity-relevant content. To prevent catastrophic forgetting of general capabilities during pretraining, the cybersecurity data is mixed with general educational content from **FineWeb-Edu**. ### Supported Tasks * **Continual Pretraining:** Designed to adapt general-purpose LLMs (e.g., Qwen, Llama) to the cybersecurity domain. * **Domain Adaptation:** Enhances model performance on cybersecurity knowledge, skills, and tool usage ### Languages The dataset primarily consists of English text, derived from Common Crawl sources. ## Dataset Structure ### Data Instances The dataset is partitioned into 5 chunks (config names: `chunk_1` through `chunk_5`). Each instance represents a single document (e.g., a web page, article, or forum post). ### Data Fields Based on the provided configuration, the data fields are: * **`text`** (string): The full text content of the document. * **`id`** (string): A unique identifier for the document. * **`metadata`** (struct): Contains detailed attributes about the source and filtering: * `probability` (float64): The confidence score from the cybersecurity classifier. * `relevant` (bool): A flag indicating if the document passed the relevance filter. * `url` (string): The source URL of the document. * `date` (timestamp): The crawl or publication date. * `dump` (string): The Common Crawl dump identifier (e.g., `CC-MAIN-2024-51`). * `file_path` (string): Path information for the original file. * `language` (string): The detected language of the text. * `language_score` (float64): Confidence score of the language detection. * `token_count` (int64): The number of tokens in the document. * `score`, `int_score`: Additional quality or relevance metrics. ### Data Splits The dataset is segmented into 5 chunks. The paper notes that the final corpus consists of the "latest 5 chunks" from the filtered pipeline to fit training budgets. * **Total Size:** ~11.7B tokens. * **Total Documents:** ~13M documents. ## Dataset Creation ### Curation Rationale Existing cybersecurity solutions often rely on proprietary APIs or lack domain adaptation. RedSage-CFW bridges this gap by providing a transparent, open-source corpus for training local, privacy-preserving cybersecurity assistants. ### Source Data * **FineWeb:** The base corpus is FineWeb, aggregated from 104 Common Crawl subsets between Summer 2013 and December 2024 (~17.2T tokens). * **FineWeb-Edu:** Used for mixing general knowledge to maintain reasoning capabilities. ### Data Processing & Filtering 1. **Classifier Training:** A binary classifier based on **ModernBERT-base** was trained on the "Cybersecurity Topic Classification" dataset (sourced from Reddit, StackExchange, and arXiv). It achieved 97.3% accuracy on validation. 2. **Filtering:** This classifier was applied to FineWeb, identifying \~125M cybersecurity-relevant documents (\~89.8B tokens). 3. **General Knowledge Replay:** To avoid catastrophic forgetting, the cybersecurity data was mixed with FineWeb-Edu samples at a **30% replay ratio**. 4. **Deduplication:** Global deduplication was performed using MinHash-LSH (via DataTrove), reducing the corpus size by ~47.9% in tokens. 5. **Chunking:** The final dataset comprises the latest 5 chronological chunks from the processed data to manage computational costs. ## Considerations for Using the Data ### Social Impact The dataset enables the development of open-source cybersecurity assistants, potentially helping to bridge the global skills shortage in the field. ### Discussion of Biases and Limitations * **Source Bias:** As a web-crawled dataset, it may inherit biases present in Common Crawl and online cybersecurity discussions. * **Dual Use:** The dataset may contains offensive security knowledge (e.g., penetration testing techniques). While intended for defense, there is an inherent risk of misuse. --- ## Citation ```bibtex @inproceedings{suryanto2026redsage, title={RedSage: A Cybersecurity Generalist {LLM}}, author={Naufal Suryanto and Muzammal Naseer and Pengfei Li and Syed Talal Wasim and Jinhui Yi and Juergen Gall and Paolo Ceravolo and Ernesto Damiani}, booktitle={The Fourteenth International Conference on Learning Representations}, year={2026}, url={https://openreview.net/forum?id=W4FAenIrQ2} } ```
提供机构:
RISys-Lab
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作