AZERG-Dataset

Name: AZERG-Dataset
Creator: maas
Published: 2025-12-05 16:43:03
License: 暂无描述

魔搭社区2025-12-05 更新2025-09-06 收录

下载链接：

https://modelscope.cn/datasets/QCRI/AZERG-Dataset

下载链接

链接失效反馈

官方服务：

资源简介：

# AZERG-Dataset This repository contains the AZERG-Dataset, a comprehensive collection of annotated cyber threat intelligence (CTI) reports designed for training and evaluating models on STIX entity and relationship extraction. This dataset was created for the paper: "From Text to Actionable Intelligence: Automating STIX Entity and Relationship Extraction". It is the largest publicly available dataset of its kind, meticulously annotated with STIX-compliant entities and relationships to facilitate the development of automated threat intelligence tools. # 📖 Dataset Overview The AZERG-Dataset is built from 141 real-world threat analysis reports and contains 4,011 STIX entities and 2,075 STIX relationships. It was curated to address the lack of training data for automated STIX report generation and supports a multi-task approach to threat intelligence extraction. The extraction process is divided into four sequential subtasks: - T1: Entity Detection: Identifying all STIX entities (SDOs and SCOs) in a text passage. - T2: Entity Type Identification: Assigning a specific STIX type to each detected entity. - T3: Related Pair Detection: Identifying which pairs of entities are semantically related based on the text. - T4: Relationship Type Identification: Determining the precise STIX relationship type (e.g., uses, targets) between a related pair of entities. ## 📂 Dataset Structure The dataset is organized into train and test splits. The training and testing data are sourced from completely non-overlapping reports and vendors to ensure a robust evaluation of model generalization. ``` AZERG-Dataset/ ├── train/ │ ├── azerg_T1_train.json │ ├── azerg_T2_train.json │ ├── azerg_T3_train.json │ ├── azerg_T4_train.json │ └── azerg_MixTask_train.json # Combined data for all tasks └── test/ ├── annoctr_T1_test.json ├── annoctr_T2_test.json ├── annoctr_T3_test.json ├── annoctr_T4_test.json ├── azerg_T1_test.json ├── azerg_T2_test.json ├── azerg_T3_test.json └── azerg_T4_test.json ``` ## 📜 Citation If you use this dataset in your research, please cite the original paper (ArXiv for now, the paper is accepted at RAID 2025): ``` @article{lekssays2025azerg, title={From Text to Actionable Intelligence: Automating STIX Entity and Relationship Extraction}, author={Lekssays, Ahmed and Sencar, Husrev Taha and Yu, Ting}, journal={arXiv preprint arXiv:2507.16576}, year={2025} } ```

# AZERG数据集本仓库收录AZERG数据集，这是一套经过标注的网络威胁情报（Cyber Threat Intelligence, CTI）报告的综合集合，专为训练与评估结构化威胁信息表达（STIX）实体与关系抽取模型而打造。本数据集为论文"From Text to Actionable Intelligence: Automating STIX Entity and Relationship Extraction"（《从文本到可行动情报：自动化STIX实体与关系抽取》）所创制，是当前同类公开数据集中规模最大的一项，已针对符合STIX规范的实体与关系完成精细标注，旨在助力自动化威胁情报工具的研发。 # 📖 数据集概览 AZERG数据集源自141份真实威胁分析报告，共包含4011个STIX实体与2075个STIX关系。本数据集旨在填补自动化STIX报告生成领域训练数据的空白，并可支撑面向威胁情报抽取的多任务学习范式。抽取流程划分为四个依次递进的子任务： - T1：实体检测：识别文本段落内的全部STIX实体（涵盖STIX域对象（STIX Domain Objects, SDOs）与STIX网络可观测对象（STIX Cyber Observables, SCOs））。 - T2：实体类型识别：为每个已检测到的实体分配特定的STIX类型。 - T3：相关实体对检测：基于文本语义识别存在关联的实体对。 - T4：关系类型识别：确定关联实体对间精准的STIX关系类型（如uses、targets）。 ## 📂 数据集结构本数据集划分为训练集与测试集。训练与测试数据源自完全无重叠的报告来源与厂商，以确保对模型泛化能力的稳健评估。 AZERG-Dataset/ ├── train/ │ ├── azerg_T1_train.json │ ├── azerg_T2_train.json │ ├── azerg_T3_train.json │ ├── azerg_T4_train.json │ └── azerg_MixTask_train.json # 适用于所有任务的组合数据 └── test/ ├── annoctr_T1_test.json ├── annoctr_T2_test.json ├── annoctr_T3_test.json ├── annoctr_T4_test.json ├── azerg_T1_test.json ├── azerg_T2_test.json ├── azerg_T3_test.json └── azerg_T4_test.json ## 📜 引用说明若您在研究中使用本数据集，请引用该原创论文（目前发布于arXiv预印本平台，已被RAID 2025会议收录）： @article{lekssays2025azerg, title={From Text to Actionable Intelligence: Automating STIX Entity and Relationship Extraction}, author={Lekssays, Ahmed and Sencar, Husrev Taha and Yu, Ting}, journal={arXiv preprint arXiv:2507.16576}, year={2025} }

提供机构：

maas

创建时间：

2025-07-23

5,000+

优质数据集

54 个

任务类型

进入经典数据集