five

Multilingual-Multimodal-NLP/SEVENLLM-Dataset

收藏
Hugging Face2024-05-09 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/Multilingual-Multimodal-NLP/SEVENLLM-Dataset
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 language: - en - zh --- # Introduce We provided, designed for analyzing cybersecurity incidents, which is comprised of two primary task categories: understanding and generation, with a further breakdown into 28 subcategories of tasks. The dataset is in question and answer format, using structured json format for understanding tasks and unstructured text format for generation tasks. We also provide some multiple-choice questions to test the cognitive ability of the model in different vertical fields. **Please note that these data contain sensitive words in the field of network security, so they may trigger the protection mechanisms of some terminals.** # Dataset Structure ## Data Instances Our data point comprises a context, a question about the context, and an answer for the question. In addition to these, we also added task types and the thinking process for answering the tasks in the dataset. An example from the dataset looks like the following: ```json { "category": "...", "instruction": "...", "input": "...", "thought": "...", "output": "..." } ``` ## Data Fields **category:** The subtask type to which the sample belongs. **instruction:** An instruction question for this subtask. **input:** Original corpus of network security incidents. **thought:** Thinking process based on original corpus and questions that can be referenced. **output:** Answers generated to questions and original corpus. ## Data Splits | Type | Filename | Sample Size | |-----------|-----------|-----------| | SEVENLLM-Instruct | train.jsonl | 91401 | | SEVENLLM-Bench | test.json | 1300 | # Further Information and Resources For more detailed information, please refer to our [published paper](https://arxiv.org/abs/2405.03446). Additionally, we have made the source code available on our [GitHub repository](https://github.com/CSJianYang/SEevenLLM). We appreciate your interest and support. Feel free to contact us if you have any question or cooperation! Email: jhy_1@buaa.edu.cn
提供机构:
Multilingual-Multimodal-NLP
原始信息汇总

数据集概述

数据集目的

本数据集专为分析网络安全事件设计,包含两大任务类别:理解和生成,进一步细分为28个子任务类别。

数据集格式

  • 理解任务采用结构化的JSON格式。
  • 生成任务采用非结构化的文本格式。
  • 提供多选题以测试模型在不同垂直领域的认知能力。

数据集内容敏感性

数据包含网络安全的敏感词汇,可能触发某些终端的保护机制。

数据集结构

数据实例

每个数据点包括:

  • 上下文
  • 关于上下文的提问
  • 问题的答案
  • 任务类型
  • 解答任务的思考过程

数据字段

  • category:样本所属的子任务类型。
  • instruction:此子任务的指令问题。
  • input:网络安全事件的原始文集。
  • thought:基于原始文集和问题的思考过程。
  • output:对问题和原始文集的答案。

数据分割

  • SEVENLLM-Instruct (train.jsonl):91401个样本。
  • SEVENLLM-Bench (test.json):1300个样本。

联系方式

如有问题或合作意向,请联系:jhy_1@buaa.edu.cn

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作