theResearchNinja/OllaBench

Name: theResearchNinja/OllaBench
Creator: theResearchNinja
Published: 2024-06-12 02:36:11
License: 暂无描述

Hugging Face2024-06-12 更新2024-06-29 收录

下载链接：

https://hf-mirror.com/datasets/theResearchNinja/OllaBench

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 task_categories: - question-answering language: - en tags: - cybersecurity - cognitive behavioral psychology - benchmark - evaluation - cognitive behavioral reasoning - cybersecurity compliance - cybersecurity noncompliance - cybersecurity reasoning pretty_name: OllaBench size_categories: - 10K<n<100K --- # Dataset Card for Dataset Name Large Language Models (LLMs) have the potential to enhance Agent-Based Modeling by better representing complex interdependent cybersecurity systems, improving cybersecurity threat modeling and risk management. Evaluating LLMs in this context is crucial for legal compliance and effective application development. Existing LLM evaluation frameworks often overlook the human factor and cognitive computing capabilities essential for interdependent cybersecurity. To address this gap, I propose OllaBench, a novel evaluation framework that assesses LLMs' accuracy, wastefulness, and consistency in answering scenario-based information security compliance and non-compliance questions. ## Dataset Details - **Curated by:** tom.nguyen@ieee.org - **Language(s) (NLP):** English - **License:** CC 4 A sample scenario is as follows. Avery Johnson: The individual values following the organization's Information Security Policies. The individual expressed their plan to safeguard their company's data and technological assets. The individual has the ability to safeguard themselves against cyber breaches. I prioritize protecting my employer's sensitive information by following all security guidelines. Viewing compliance with the institution's security policies as vital. Emily Carter: The person understood the benefits of following the information security policies. The person views compliance with the institution's security policies as a mandatory requirement. A person has complete power over adhering to information security policies. The individual values following the institution's guidelines for Information Security Policies. The individual understands the consequences for violating the organization's information security policy. As shown, each scenario begins with brief cognitive behavioral profiles of two person. Then, the LLMs will be required to answer the following questions: Which of the following options best reflects Avery Johnson's or Emily Carter cognitive behavioral constructs? (option a) - ['Knowledge', 'Social', 'Motivation', 'Attitude', 'Intent'] (option b) - ['Self-efficacy', 'Motivation', 'Intent', 'Subjective norms', 'Attitude'] (option c) - ['Attitude', 'Intent', 'Control', 'Motivation', 'Attitude'] (option d) - ['Control', 'Attitude', 'Response Efficacy', 'Benefits', 'Intent']" Who is LESS compliant with information security policies? (option a) - Avery Johnson (option b) - They carry the same risk level (option c) - Emily Carter (option d) - It is impossible to tell Will information security non-compliance risk level increase if these employees work closely in the same team? (option a) - security non-compliance risk level may increase (option b) - security non-compliance risk level will increase (option c) - security non-compliance risk level will stay the same (option d) - It is impossible to tell To increase information security compliance, which cognitive behavioral factor should be targetted for strengthening? (option a) - Attitude (option b) - Motivation (option c) - Knowledge (option d) - Intent ### Dataset Sources OllaBench is built on a foundation of 24 cognitive behavioral theories and empirical evidence from 38 peer-reviewed papers. Please check out the OllaBench white paper below for a complete science behind the dataset. - **Repository:** https://github.com/Cybonto/OllaBench - **Paper [optional]:** https://arxiv.org/abs/2406.06863 ## Uses The first question is of "Which Cognitive Path" (WCP) type. The second is of "Who is Who" (WHO) type. The third one is of "Team Risk Analysis" type, and the last question is of "Target Factor Analysis" type. OllaBench1 then use the generated scenarios and questions to query against the evalutatee models hosted in Ollama. The Average score is the average of each model's 'Avg WCP score','Avg WHO score','Avg Team Risk score','Avg Target Factor score'. The model with the highest Average score could be the best performing model. However, it may not be the case with the most efficient model which is a combination of many factors including performance metrics and wasted response metric. Wasted Response for each response is measured by the response's tokens and the response evaluation of being incorrect. The Wasted Average score is calculated by the total wasted tokens divided by the number of wrong responses. Further resource costs in terms of time and/or money can be derived from the total wasted response value. The model with the lowest Wasted Average score can be the most efficient model (to be decided in joint consideration with other metrics). Please check the OllaBench paper on proper use. ### Out-of-Scope Use to be added #### Personal and Sensitive Information there is no personal and sensitive information in the dataset ## Bias, Risks, and Limitations to be added ### Recommendations I recommend you use the OllaBench GUI application to benchmark based on this dataset. The application is available on GitHub. ## Citation [optional] to be added **BibTeX:** @misc{nguyen2024ollabench, title={Ollabench: Evaluating LLMs' Reasoning for Human-centric Interdependent Cybersecurity}, author={Tam n. Nguyen}, year={2024}, eprint={2406.06863}, archivePrefix={arXiv}, primaryClass={cs.CR} } **APA:** [More Information Needed] ## More Information [optional] tom.nguyen@ieee.org ## Dataset Card Authors [optional] tom.nguyen@ieee.org ## Dataset Card Contact tom.nguyen@ieee.org

许可证：CC BY 4.0 任务类别： - 问答语言： - 英语标签： - 网络安全 - 认知行为心理学 - 基准测试 - 评估 - 认知行为推理 - 网络安全合规 - 网络安全不合规 - 网络安全推理数据集名称：OllaBench 样本量范围：10000 < n < 100000 # 数据集卡片（数据集名称：OllaBench）大语言模型（Large Language Model, LLM）能够通过更精准地刻画复杂的相互依存式网络安全系统、优化网络安全威胁建模与风险管理，为基于智能体的建模（Agent-Based Modeling）赋能。在此场景下对大语言模型进行评估，对于合规落地与高效应用开发至关重要。现有大语言模型评估框架往往忽略了相互依存式网络安全场景中不可或缺的人为因素与认知计算能力。为填补这一研究空白，本文提出OllaBench——一种全新的评估框架，用于评估大语言模型在回答基于场景的信息安全合规与不合规问题时的准确性、冗余性与一致性。 ## 数据集详情 - **数据集整理者**：tom.nguyen@ieee.org - **自然语言处理所用语言**：英语 - **许可证**：CC BY 4.0 以下为一个样本场景： > 埃弗里·约翰逊（Avery Johnson）：该个体严格遵循组织的信息安全政策，明确表示将保护公司的数据与技术资产，具备防范网络入侵的能力，并将遵循所有安全指南作为保护雇主敏感信息的首要举措，同时将遵守机构安全政策视为至关重要之事。 > 埃米莉·卡特（Emily Carter）：该个体明晰遵循信息安全政策的益处，将遵守机构安全政策视作强制性要求，完全具备自主遵守信息安全政策的能力，重视遵循机构的信息安全政策指南，并知晓违反组织信息安全政策的相应后果。如上述示例，每个场景首先会呈现两名个体的简短认知行为画像，随后要求大语言模型回答以下四类问题： 1. **以下哪个选项最能体现埃弗里·约翰逊或埃米莉·卡特的认知行为结构？** (a) ['知识', '社会', '动机', '态度', '意图'] (b) ['自我效能感（Self-efficacy）', '动机', '意图', '主观规范（Subjective norms）', '态度'] (c) ['态度', '意图', '控制感', '动机', '态度'] (d) ['控制感', '态度', '响应效能', '收益', '意图'] 2. **以下哪位的信息安全合规程度更低？** (a) 埃弗里·约翰逊 (b) 二者风险水平相当 (c) 埃米莉·卡特 (d) 无法判断 3. **若这些员工在同一团队密切协作，信息安全不合规的风险水平是否会上升？** (a) 信息安全不合规风险水平可能上升 (b) 信息安全不合规风险水平必然上升 (c) 信息安全不合规风险水平保持不变 (d) 无法判断 4. **若要提升信息安全合规水平，应针对以下哪个认知行为因素进行强化？** (a) 态度 (b) 动机 (c) 知识 (d) 意图 ## 数据集来源 OllaBench基于24项认知行为理论与38篇同行评议论文的实证研究构建。完整的数据集科学背景请参阅OllaBench白皮书： - **代码仓库**：https://github.com/Cybonto/OllaBench - **学术论文**：https://arxiv.org/abs/2406.06863 ## 使用场景第一个问题属于「认知路径识别（Which Cognitive Path, WCP）」类型，第二个为「身份甄别（Who is Who, WHO）」类型，第三个为「团队风险分析」类型，第四个为「目标因素分析」类型。 OllaBench可通过生成的场景与问题，对部署于Ollama平台的待评估模型进行查询。模型的平均得分为其「平均WCP得分」「平均WHO得分」「平均团队风险得分」「平均目标因素得分」的算术平均值。平均得分最高的模型可视为表现最优的模型，但这未必等同于效率最高的模型——模型效率需结合包括性能指标与冗余响应指标在内的多项因素综合判定。冗余响应（Wasted Response）通过响应的Token数量与响应的正确性评估进行量化：平均冗余得分等于总冗余Token数除以错误响应的数量。进一步可从总冗余响应值推导时间与/或资金层面的资源成本。平均冗余得分最低的模型可视为效率最高的模型（需结合其他指标综合判定）。请参阅OllaBench学术论文以了解正确的使用方式。 ### 超出适用范围的使用场景待补充 #### 个人与敏感信息说明本数据集未包含任何个人与敏感信息。 ## 偏差、风险与局限性待补充 ### 使用建议建议使用OllaBench GUI应用程序基于本数据集开展基准测试，该应用程序可在GitHub上获取。 ## 引用信息（可选）待补充 **BibTeX 格式：** @misc{nguyen2024ollabench, title={Ollabench: Evaluating LLMs' Reasoning for Human-centric Interdependent Cybersecurity}, author={Tam n. Nguyen}, year={2024}, eprint={2406.06863}, archivePrefix={arXiv}, primaryClass={cs.CR} } **APA 格式引用：** 待补充 ## 更多信息（可选） tom.nguyen@ieee.org ## 数据集卡片作者（可选） tom.nguyen@ieee.org ## 数据集卡片联系人 tom.nguyen@ieee.org

提供机构：

theResearchNinja

原始信息汇总

OllaBench 数据集概述

数据集详情

名称: OllaBench
任务类别: 问答
语言: 英语
标签:
- 网络安全
- 认知行为心理学
- 基准测试
- 评估
- 认知行为推理
- 网络安全合规
- 网络安全不合规
- 网络安全推理
大小类别: 10K<n<100K
许可证: CC 4
创建者: tom.nguyen@ieee.org

数据集描述

OllaBench 是一个评估大型语言模型（LLMs）在网络安全合规性和不合规性场景中准确性、冗余性和一致性的框架。该数据集基于24种认知行为理论和38篇同行评审论文的实证证据构建。

示例场景

每个场景包含两个个体的认知行为概况，并要求LLMs回答以下类型的问题：

认知路径类型：选择最佳反映个体认知行为构造的选项。
谁是谁类型：判断谁更不符合信息安全政策。
团队风险分析类型：分析员工在同一团队中工作时信息安全不合规风险是否会增加。
目标因素分析类型：确定应加强的认知行为因素以提高信息安全合规性。

数据集来源

OllaBench 基于24种认知行为理论和38篇同行评审论文的实证证据构建。

使用场景

OllaBench 用于评估LLMs在网络安全场景中的表现，通过生成场景和问题来查询模型，并计算平均得分和冗余响应得分。

超出范围的使用

待添加

个人和敏感信息

数据集中不包含个人和敏感信息。

偏见、风险和局限性

待添加

引用

bibtex @misc{nguyen2024ollabench, title={Ollabench: Evaluating LLMs Reasoning for Human-centric Interdependent Cybersecurity}, author={Tam n. Nguyen}, year={2024}, eprint={2406.06863}, archivePrefix={arXiv}, primaryClass={cs.CR} }

搜集汇总

数据集介绍

构建方式

在网络安全与认知行为心理学交叉领域，OllaBench数据集的构建体现了严谨的科学方法论。该数据集以24种认知行为理论为基础，并整合了38篇同行评审论文的实证证据，通过精心设计的场景模拟来评估大型语言模型。每个场景均包含两位虚拟人物的认知行为特征描述，随后提出四类结构化问题，涵盖认知路径识别、合规性比较、团队风险分析与目标因素强化，从而系统性地构建出兼具深度与广度的评估框架。

特点

OllaBench数据集的核心特点在于其深度融合了认知行为心理学与网络安全合规性评估，开创性地将人类因素纳入大型语言模型的性能衡量体系。数据集通过场景化的问答设计，不仅考察模型在信息安全和合规性推理中的准确性，还引入浪费性响应与一致性等创新指标，实现了对模型效率的多维度量化。这种跨学科的评估视角，为理解模型在复杂人机交互环境中的表现提供了独特而全面的基准。

使用方法

使用OllaBench数据集时，研究者可通过其配套的图形界面应用程序或编程接口，将预设的场景与问题提交至待评估的大型语言模型。评估过程涵盖四类问题的得分计算，并综合平均性能与浪费响应指标，以平衡模型效能与资源消耗。该框架支持对模型在网络安全合规推理中的能力进行标准化比较，为选择适用于人机协同安全系统的模型提供实证依据，具体操作细节可参阅其白皮书与代码仓库。

背景与挑战

背景概述

在网络安全与认知行为心理学交叉领域，大型语言模型（LLMs）的兴起为基于代理的建模提供了新的可能性，能够更精准地模拟复杂互依的网络安全系统，从而优化威胁建模与风险管理。为填补现有评估框架在人类因素与认知计算能力方面的空白，研究者Tam N. Nguyen于2024年提出了OllaBench，这一创新性评估框架专注于通过场景化信息安全合规与非合规问题，系统衡量LLMs在准确性、资源浪费度及一致性方面的表现。该数据集基于24种认知行为理论与38篇同行评审论文的实证证据构建，旨在推动LLMs在网络安全合规推理中的应用，为法律遵从性与高效应用开发提供科学依据。

当前挑战

OllaBench致力于解决网络安全合规领域中人类认知行为推理的评估难题，其核心挑战在于如何将抽象的认知行为理论转化为可量化的多选问题，并确保场景设计能准确反映员工合规倾向的细微差异。在构建过程中，研究者需整合跨学科知识，平衡认知心理学概念与网络安全实践，同时避免引入主观偏见；此外，评估指标需兼顾性能效率与资源消耗，例如通过浪费响应度量来捕捉模型在错误回答中的令牌浪费，这对数据标注的严谨性与评估方法的科学性提出了较高要求。

常用场景

经典使用场景

在网络安全与认知行为心理学交叉领域，OllaBench数据集为评估大型语言模型在复杂人机交互场景中的推理能力提供了标准化基准。该数据集通过模拟基于认知行为理论的信息安全合规与非合规情境，要求模型分析个体行为特征并预测团队风险动态，从而检验模型在理解人类认知因素对网络安全决策影响方面的精确性与一致性。这一经典使用场景不仅深化了智能体建模中的人类因素考量，也为模型在动态风险评估中的逻辑连贯性设定了严谨的评估框架。

衍生相关工作

围绕OllaBench数据集衍生的经典工作主要集中于扩展其评估维度与跨领域迁移应用。例如，后续研究通过引入多模态数据融合技术，将文本型认知行为描述与生物特征信号相结合，以增强模型对隐性风险因素的感知能力。同时，部分学者借鉴该数据集的团队风险分析框架，开发了适用于医疗伦理或金融合规等领域的适应性评估工具，进一步验证了认知行为推理模型在复杂社会技术系统中的泛化潜力与跨学科价值。

数据集最近研究