Multilingual-Multimodal-NLP/SEVENLLM-Dataset

Name: Multilingual-Multimodal-NLP/SEVENLLM-Dataset
Creator: Multilingual-Multimodal-NLP
Published: 2024-05-09 06:40:08
License: 暂无描述

Hugging Face2024-05-09 更新2024-06-12 收录

下载链接：

https://hf-mirror.com/datasets/Multilingual-Multimodal-NLP/SEVENLLM-Dataset

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 language: - en - zh --- # Introduce We provided, designed for analyzing cybersecurity incidents, which is comprised of two primary task categories: understanding and generation, with a further breakdown into 28 subcategories of tasks. The dataset is in question and answer format, using structured json format for understanding tasks and unstructured text format for generation tasks. We also provide some multiple-choice questions to test the cognitive ability of the model in different vertical fields. **Please note that these data contain sensitive words in the field of network security, so they may trigger the protection mechanisms of some terminals.** # Dataset Structure ## Data Instances Our data point comprises a context, a question about the context, and an answer for the question. In addition to these, we also added task types and the thinking process for answering the tasks in the dataset. An example from the dataset looks like the following: ```json { "category": "...", "instruction": "...", "input": "...", "thought": "...", "output": "..." } ``` ## Data Fields **category:** The subtask type to which the sample belongs. **instruction:** An instruction question for this subtask. **input:** Original corpus of network security incidents. **thought:** Thinking process based on original corpus and questions that can be referenced. **output:** Answers generated to questions and original corpus. ## Data Splits | Type | Filename | Sample Size | |-----------|-----------|-----------| | SEVENLLM-Instruct | train.jsonl | 91401 | | SEVENLLM-Bench | test.json | 1300 | # Further Information and Resources For more detailed information, please refer to our [published paper](https://arxiv.org/abs/2405.03446). Additionally, we have made the source code available on our [GitHub repository](https://github.com/CSJianYang/SEevenLLM). We appreciate your interest and support. Feel free to contact us if you have any question or cooperation! Email: jhy_1@buaa.edu.cn

提供机构：

Multilingual-Multimodal-NLP

原始信息汇总

数据集概述

数据集目的

本数据集专为分析网络安全事件设计，包含两大任务类别：理解和生成，进一步细分为28个子任务类别。

数据集格式

理解任务采用结构化的JSON格式。
生成任务采用非结构化的文本格式。
提供多选题以测试模型在不同垂直领域的认知能力。

数据集内容敏感性

数据包含网络安全的敏感词汇，可能触发某些终端的保护机制。

数据集结构

数据实例

每个数据点包括：

上下文
关于上下文的提问
问题的答案
任务类型
解答任务的思考过程

数据字段

category：样本所属的子任务类型。
instruction：此子任务的指令问题。
input：网络安全事件的原始文集。
thought：基于原始文集和问题的思考过程。
output：对问题和原始文集的答案。

数据分割

SEVENLLM-Instruct (train.jsonl)：91401个样本。
SEVENLLM-Bench (test.json)：1300个样本。

联系方式

如有问题或合作意向，请联系：jhy_1@buaa.edu.cn

5,000+

优质数据集

54 个

任务类型

进入经典数据集