AZERG-Dataset
收藏魔搭社区2025-12-05 更新2025-09-06 收录
下载链接:
https://modelscope.cn/datasets/QCRI/AZERG-Dataset
下载链接
链接失效反馈官方服务:
资源简介:
# AZERG-Dataset
This repository contains the AZERG-Dataset, a comprehensive collection of annotated cyber threat intelligence (CTI) reports designed for training and evaluating models on STIX entity and relationship extraction.
This dataset was created for the paper: "From Text to Actionable Intelligence: Automating STIX Entity and Relationship Extraction". It is the largest publicly available dataset of its kind, meticulously annotated with STIX-compliant entities and relationships to facilitate the development of automated threat intelligence tools.
# 📖 Dataset Overview
The AZERG-Dataset is built from 141 real-world threat analysis reports and contains 4,011 STIX entities and 2,075 STIX relationships. It was curated to address the lack of training data for automated STIX report generation and supports a multi-task approach to threat intelligence extraction.
The extraction process is divided into four sequential subtasks:
- T1: Entity Detection: Identifying all STIX entities (SDOs and SCOs) in a text passage.
- T2: Entity Type Identification: Assigning a specific STIX type to each detected entity.
- T3: Related Pair Detection: Identifying which pairs of entities are semantically related based on the text.
- T4: Relationship Type Identification: Determining the precise STIX relationship type (e.g., uses, targets) between a related pair of entities.
## 📂 Dataset Structure
The dataset is organized into train and test splits. The training and testing data are sourced from completely non-overlapping reports and vendors to ensure a robust evaluation of model generalization.
```
AZERG-Dataset/
├── train/
│ ├── azerg_T1_train.json
│ ├── azerg_T2_train.json
│ ├── azerg_T3_train.json
│ ├── azerg_T4_train.json
│ └── azerg_MixTask_train.json # Combined data for all tasks
└── test/
├── annoctr_T1_test.json
├── annoctr_T2_test.json
├── annoctr_T3_test.json
├── annoctr_T4_test.json
├── azerg_T1_test.json
├── azerg_T2_test.json
├── azerg_T3_test.json
└── azerg_T4_test.json
```
## 📜 Citation
If you use this dataset in your research, please cite the original paper (ArXiv for now, the paper is accepted at RAID 2025):
```
@article{lekssays2025azerg,
title={From Text to Actionable Intelligence: Automating STIX Entity and Relationship Extraction},
author={Lekssays, Ahmed and Sencar, Husrev Taha and Yu, Ting},
journal={arXiv preprint arXiv:2507.16576},
year={2025}
}
```
# AZERG数据集
本仓库收录AZERG数据集,这是一套经过标注的网络威胁情报(Cyber Threat Intelligence, CTI)报告的综合集合,专为训练与评估结构化威胁信息表达(STIX)实体与关系抽取模型而打造。
本数据集为论文"From Text to Actionable Intelligence: Automating STIX Entity and Relationship Extraction"(《从文本到可行动情报:自动化STIX实体与关系抽取》)所创制,是当前同类公开数据集中规模最大的一项,已针对符合STIX规范的实体与关系完成精细标注,旨在助力自动化威胁情报工具的研发。
# 📖 数据集概览
AZERG数据集源自141份真实威胁分析报告,共包含4011个STIX实体与2075个STIX关系。本数据集旨在填补自动化STIX报告生成领域训练数据的空白,并可支撑面向威胁情报抽取的多任务学习范式。
抽取流程划分为四个依次递进的子任务:
- T1:实体检测:识别文本段落内的全部STIX实体(涵盖STIX域对象(STIX Domain Objects, SDOs)与STIX网络可观测对象(STIX Cyber Observables, SCOs))。
- T2:实体类型识别:为每个已检测到的实体分配特定的STIX类型。
- T3:相关实体对检测:基于文本语义识别存在关联的实体对。
- T4:关系类型识别:确定关联实体对间精准的STIX关系类型(如uses、targets)。
## 📂 数据集结构
本数据集划分为训练集与测试集。训练与测试数据源自完全无重叠的报告来源与厂商,以确保对模型泛化能力的稳健评估。
AZERG-Dataset/
├── train/
│ ├── azerg_T1_train.json
│ ├── azerg_T2_train.json
│ ├── azerg_T3_train.json
│ ├── azerg_T4_train.json
│ └── azerg_MixTask_train.json # 适用于所有任务的组合数据
└── test/
├── annoctr_T1_test.json
├── annoctr_T2_test.json
├── annoctr_T3_test.json
├── annoctr_T4_test.json
├── azerg_T1_test.json
├── azerg_T2_test.json
├── azerg_T3_test.json
└── azerg_T4_test.json
## 📜 引用说明
若您在研究中使用本数据集,请引用该原创论文(目前发布于arXiv预印本平台,已被RAID 2025会议收录):
@article{lekssays2025azerg,
title={From Text to Actionable Intelligence: Automating STIX Entity and Relationship Extraction},
author={Lekssays, Ahmed and Sencar, Husrev Taha and Yu, Ting},
journal={arXiv preprint arXiv:2507.16576},
year={2025}
}
提供机构:
maas
创建时间:
2025-07-23



