SecLLM

NIAID Data Ecosystem2026-05-02 收录

下载链接：

https://zenodo.org/record/14254294

下载链接

链接失效反馈

官方服务：

资源简介：

Generative Artificial Intelligence, and in particular Large Language Models (LLMs) and LLM-based Agents, have significantly changed the way how we (humans) perform our daily activities; features like being able to establish smart conversations with and without context, as well as their capabilities for answering questions from any kind and domain is attracting more attention from the researchers and practitioners communities because we need to understand and assess the weaknesses and strengths of the models. For instance, hallucination is a well-known issue in LLMs, as well as the possibility of providing inappropriate questions when the models lack filters or are biased or poisoned. Previous work has been devoted to asses LLMs under different contexts and scenarios, e.g., code generation. However, few studies have been done in the context of information security; to our knowledge, no previous work has analyzed the quality of answers provided by LLMs to cybersecurity-related questions. Therefore, we present a dataset of questions extracted from StackExchange, including their top-10 answers and the ones generated by three GPT models (3.5-Turbo, 4-4o) for 5K+ questions; the dataset also includes similarity metrics (e.g., ROUGE, SacreBLUE, BERTScore) of the LLM-based answers when compared to the human-accepted ones.

生成式人工智能（Generative Artificial Intelligence），尤其是大语言模型（Large Language Models，LLMs）以及基于大语言模型的AI智能体，已然深刻改变了人类开展日常活动的模式。其支持带上下文与无上下文的智能对话、可应答任意领域各类问题的特性，正愈发受到研究者与从业者群体的关注——这是因为我们亟需深入理解并评估这类模型的优势与局限。例如，幻觉现象是大语言模型中广为人知的问题；此外，当模型缺乏过滤机制、存在偏见或被投毒时，还可能生成不当应答。此前已有诸多研究针对不同场景与上下文环境下的大语言模型展开评估，例如代码生成领域。然而，针对信息安全场景的相关研究却寥寥无几；据我们所知，目前尚无研究针对大语言模型对网络安全相关问题的应答质量展开分析。因此，本研究构建了一套从Stack Exchange社区提取的问题数据集，涵盖对应问题的前10条人工优质回答，以及针对5000余个问题由三款GPT模型（3.5-Turbo、4-4o）生成的应答；该数据集同时包含了基于大语言模型生成的应答与人类认可回答之间的相似度评测指标，例如ROUGE、SacreBLUE、BERTScore。

创建时间：

2024-12-02

5,000+

优质数据集

54 个

任务类型

进入经典数据集