thangvip/cti-dataset-split

Name: thangvip/cti-dataset-split
Creator: thangvip
Published: 2023-11-22 08:59:44
License: 暂无描述

Hugging Face2023-11-22 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/thangvip/cti-dataset-split

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集包含三个配置：默认配置（default）、子集1（subset1）和子集2（subset2）。每个配置都包含四个主要特征：sentence_idx（句子索引）、words（单词序列）、POS（词性标记序列）和tag（标签序列）。数据集提供了详细的词性和标签的映射表，有助于进行自然语言处理和文本分析。每个配置还提供了训练或测试数据集的大小和示例数量，以及下载所需的大小。

提供机构：

thangvip

原始信息汇总

数据集概述

数据集配置

默认配置 (`default`)

特征:
- sentence_idx: 整数类型 (int64)
- words: 字符串序列
- POS: 整数序列 (int64)
- tag: 整数序列 (int64)
拆分:
- train:
  - 字节数: 16917605
  - 样本数: 17480
下载大小: 2164774 字节
数据集大小: 16917605 字节

子集1 (`subset1`)

特征:
- sentence_idx: 整数类型 (int64)
- words: 字符串序列
- POS: 整数序列 (int64)
- tag: 整数序列 (int64)
拆分:
- train:
  - 字节数: 13350196.989130436
  - 样本数: 13794
下载大小: 2008529 字节
数据集大小: 13350196.989130436 字节

子集2 (`subset2`)

特征:
- sentence_idx: 整数类型 (int64)
- words: 字符串序列
- POS: 整数序列 (int64)
- tag: 整数序列 (int64)
拆分:
- test:
  - 字节数: 3338033.1604691073
  - 样本数: 3449
下载大小: 502967 字节
数据集大小: 3338033.1604691073 字节

数据文件配置

默认配置 (`default`)

数据文件:
- train: data/train-*

子集1 (`subset1`)

数据文件:
- train: subset1/train-*

子集2 (`subset2`)

数据文件:
- test: subset2/test-*

字典映射

POS 标签映射

POS 到 ID: python pos_2_id = {#: 0, $: 1, "": 2, (: 3, ): 4, .: 5, :: 6, CC: 7, CD: 8, DT: 9, EX: 10, FW: 11, IN: 12, JJ: 13, JJR: 14, JJS: 15, MD: 16, NN: 17, NNP: 18, NNPS: 19, NNS: 20, PDT: 21, POS: 22, PRP: 23, PRP$: 24, RB: 25, RBR: 26, RBS: 27, RP: 28, TO: 29, VB: 30, VBD: 31, VBG: 32, VBN: 33, VBP: 34, VBZ: 35, WDT: 36, WP: 37, WP$: 38, WRB: 39}
ID 到 POS: python id_2_pos = {0: #, 1: $, 2: "", 3: (, 4: ), 5: ., 6: :, 7: CC, 8: CD, 9: DT, 10: EX, 11: FW, 12: IN, 13: JJ, 14: JJR, 15: JJS, 16: MD, 17: NN, 18: NNP, 19: NNPS, 20: NNS, 21: PDT, 22: POS, 23: PRP, 24: PRP$, 25: RB, 26: RBR, 27: RBS, 28: RP, 29: TO, 30: VB, 31: VBD, 32: VBG, 33: VBN, 34: VBP, 35: VBZ, 36: WDT, 37: WP, 38: WP$, 39: WRB}

标签映射

标签到 ID: python tag_2_id = {B-application: 0, B-cve id: 1, B-edition: 2, B-file: 3, B-function: 4, B-hardware: 5, B-language: 6, B-method: 7, B-os: 8, B-parameter: 9, B-programming language: 10, B-relevant_term: 11, B-update: 12, B-vendor: 13, B-version: 14, I-application: 15, I-edition: 16, I-hardware: 17, I-os: 18, I-relevant_term: 19, I-update: 20, I-vendor: 21, I-version: 22, O: 23}
ID 到标签: python id_2_tag = {0: B-application, 1: B-cve id, 2: B-edition, 3: B-file, 4: B-function, 5: B-hardware, 6: B-language, 7: B-method, 8: B-os, 9: B-parameter, 10: B-programming language, 11: B-relevant_term, 12: B-update, 13: B-vendor, 14: B-version, 15: I-application, 16: I-edition, 17: I-hardware, 18: I-os, 19: I-relevant_term, 20: I-update, 21: I-vendor, 22: I-version, 23: O}

5,000+

优质数据集

54 个

任务类型

进入经典数据集

thangvip/cti-dataset-split

数据集概述

数据集配置

默认配置 (default)

子集1 (subset1)

子集2 (subset2)

数据文件配置

默认配置 (default)

子集1 (subset1)

子集2 (subset2)

字典映射

POS 标签映射

标签映射

默认配置 (`default`)

子集1 (`subset1`)

子集2 (`subset2`)

默认配置 (`default`)

子集1 (`subset1`)

子集2 (`subset2`)