five

AZERG-Dataset

收藏
魔搭社区2025-12-05 更新2025-09-06 收录
下载链接:
https://modelscope.cn/datasets/QCRI/AZERG-Dataset
下载链接
链接失效反馈
官方服务:
资源简介:
# AZERG-Dataset This repository contains the AZERG-Dataset, a comprehensive collection of annotated cyber threat intelligence (CTI) reports designed for training and evaluating models on STIX entity and relationship extraction. This dataset was created for the paper: "From Text to Actionable Intelligence: Automating STIX Entity and Relationship Extraction". It is the largest publicly available dataset of its kind, meticulously annotated with STIX-compliant entities and relationships to facilitate the development of automated threat intelligence tools. # 📖 Dataset Overview The AZERG-Dataset is built from 141 real-world threat analysis reports and contains 4,011 STIX entities and 2,075 STIX relationships. It was curated to address the lack of training data for automated STIX report generation and supports a multi-task approach to threat intelligence extraction. The extraction process is divided into four sequential subtasks: - T1: Entity Detection: Identifying all STIX entities (SDOs and SCOs) in a text passage. - T2: Entity Type Identification: Assigning a specific STIX type to each detected entity. - T3: Related Pair Detection: Identifying which pairs of entities are semantically related based on the text. - T4: Relationship Type Identification: Determining the precise STIX relationship type (e.g., uses, targets) between a related pair of entities. ## 📂 Dataset Structure The dataset is organized into train and test splits. The training and testing data are sourced from completely non-overlapping reports and vendors to ensure a robust evaluation of model generalization. ``` AZERG-Dataset/ ├── train/ │ ├── azerg_T1_train.json │ ├── azerg_T2_train.json │ ├── azerg_T3_train.json │ ├── azerg_T4_train.json │ └── azerg_MixTask_train.json # Combined data for all tasks └── test/ ├── annoctr_T1_test.json ├── annoctr_T2_test.json ├── annoctr_T3_test.json ├── annoctr_T4_test.json ├── azerg_T1_test.json ├── azerg_T2_test.json ├── azerg_T3_test.json └── azerg_T4_test.json ``` ## 📜 Citation If you use this dataset in your research, please cite the original paper (ArXiv for now, the paper is accepted at RAID 2025): ``` @article{lekssays2025azerg, title={From Text to Actionable Intelligence: Automating STIX Entity and Relationship Extraction}, author={Lekssays, Ahmed and Sencar, Husrev Taha and Yu, Ting}, journal={arXiv preprint arXiv:2507.16576}, year={2025} } ```

# AZERG数据集 本仓库收录AZERG数据集,这是一套经过标注的网络威胁情报(Cyber Threat Intelligence, CTI)报告的综合集合,专为训练与评估结构化威胁信息表达(STIX)实体与关系抽取模型而打造。 本数据集为论文"From Text to Actionable Intelligence: Automating STIX Entity and Relationship Extraction"(《从文本到可行动情报:自动化STIX实体与关系抽取》)所创制,是当前同类公开数据集中规模最大的一项,已针对符合STIX规范的实体与关系完成精细标注,旨在助力自动化威胁情报工具的研发。 # 📖 数据集概览 AZERG数据集源自141份真实威胁分析报告,共包含4011个STIX实体与2075个STIX关系。本数据集旨在填补自动化STIX报告生成领域训练数据的空白,并可支撑面向威胁情报抽取的多任务学习范式。 抽取流程划分为四个依次递进的子任务: - T1:实体检测:识别文本段落内的全部STIX实体(涵盖STIX域对象(STIX Domain Objects, SDOs)与STIX网络可观测对象(STIX Cyber Observables, SCOs))。 - T2:实体类型识别:为每个已检测到的实体分配特定的STIX类型。 - T3:相关实体对检测:基于文本语义识别存在关联的实体对。 - T4:关系类型识别:确定关联实体对间精准的STIX关系类型(如uses、targets)。 ## 📂 数据集结构 本数据集划分为训练集与测试集。训练与测试数据源自完全无重叠的报告来源与厂商,以确保对模型泛化能力的稳健评估。 AZERG-Dataset/ ├── train/ │ ├── azerg_T1_train.json │ ├── azerg_T2_train.json │ ├── azerg_T3_train.json │ ├── azerg_T4_train.json │ └── azerg_MixTask_train.json # 适用于所有任务的组合数据 └── test/ ├── annoctr_T1_test.json ├── annoctr_T2_test.json ├── annoctr_T3_test.json ├── annoctr_T4_test.json ├── azerg_T1_test.json ├── azerg_T2_test.json ├── azerg_T3_test.json └── azerg_T4_test.json ## 📜 引用说明 若您在研究中使用本数据集,请引用该原创论文(目前发布于arXiv预印本平台,已被RAID 2025会议收录): @article{lekssays2025azerg, title={From Text to Actionable Intelligence: Automating STIX Entity and Relationship Extraction}, author={Lekssays, Ahmed and Sencar, Husrev Taha and Yu, Ting}, journal={arXiv preprint arXiv:2507.16576}, year={2025} }
提供机构:
maas
创建时间:
2025-07-23
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作