procit002/FirstAndLastNameOnlylistNormalizedDatasetForNER_Model_2

Name: procit002/FirstAndLastNameOnlylistNormalizedDatasetForNER_Model_2
Creator: procit002
Published: 2024-07-01 12:53:04
License: 暂无描述

Hugging Face2024-07-01 更新2024-07-06 收录

下载链接：

https://hf-mirror.com/datasets/procit002/FirstAndLastNameOnlylistNormalizedDatasetForNER_Model_2

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集主要用于自然语言处理任务，特别是词性标注、短语块分析和命名实体识别。数据集包含四个主要特征：id（字符串类型）、tokens（字符串序列）、pos_tags（词性标签序列）和chunk_tags（短语块标签序列）。pos_tags和chunk_tags都包含详细的标签名称，分别对应不同的词性和短语块类型。此外，还有ner_tags（命名实体识别标签序列），用于标识文本中的命名实体。数据集分为训练集、验证集和测试集，每个集的数据量和示例数量都有详细记录。

This dataset is primarily used for natural language processing tasks, particularly part-of-speech tagging, chunking, and named entity recognition. It includes four main features: id (string type), tokens (string sequence), pos_tags (part-of-speech tag sequence), and chunk_tags (chunk tag sequence). Both pos_tags and chunk_tags contain detailed tag names corresponding to different parts of speech and chunk types. Additionally, there are ner_tags (named entity recognition tag sequence) used to identify named entities in the text. The dataset is divided into training, validation, and test sets, with detailed records of the data size and number of examples for each set.

提供机构：

procit002

原始信息汇总

数据集概述

特征

id: 字符串类型
tokens: 字符串序列
pos_tags: 词性标签序列
- 标签名称:
  - 0: "
  - 1:
  - 2: #
  - 3: $
  - 4: (
  - 5: )
  - 6: ,
  - 7: .
  - 8: :
  - 9: ``
  - 10: CC
  - 11: CD
  - 12: DT
  - 13: EX
  - 14: FW
  - 15: IN
  - 16: JJ
  - 17: JJR
  - 18: JJS
  - 19: LS
  - 20: MD
  - 21: NN
  - 22: NNP
  - 23: NNPS
  - 24: NNS
  - 25: NN|SYM
  - 26: PDT
  - 27: POS
  - 28: PRP
  - 29: PRP$
  - 30: RB
  - 31: RBR
  - 32: RBS
  - 33: RP
  - 34: SYM
  - 35: TO
  - 36: UH
  - 37: VB
  - 38: VBD
  - 39: VBG
  - 40: VBN
  - 41: VBP
  - 42: VBZ
  - 43: WDT
  - 44: WP
  - 45: WP$
  - 46: WRB
chunk_tags: 分块标签序列
- 标签名称:
  - 0: O
  - 1: B-ADJP
  - 2: I-ADJP
  - 3: B-ADVP
  - 4: I-ADVP
  - 5: B-CONJP
  - 6: I-CONJP
  - 7: B-INTJ
  - 8: I-INTJ
  - 9: B-LST
  - 10: I-LST
  - 11: B-NP
  - 12: I-NP
  - 13: B-PP
  - 14: I-PP
  - 15: B-PRT
  - 16: I-PRT
  - 17: B-SBAR
  - 18: I-SBAR
  - 19: B-UCP
  - 20: I-UCP
  - 21: B-VP
  - 22: I-VP
ner_tags: 命名实体标签序列
- 标签名称:
  - 0: O
  - 1: B-PER
  - 2: I-PER
  - 3: B-ORG
  - 4: I-ORG
  - 5: B-LOC
  - 6: I-LOC
  - 7: B-MISC
  - 8: I-MISC

数据集分割

train:
- 样本数量: 27095
- 字节数: 1755770.239894889
validation:
- 样本数量: 3387
- 字节数: 219479.38005255544
test:
- 样本数量: 3387
- 字节数: 219479.38005255544

数据集大小

下载大小: 660586 字节
数据集总大小: 2194729.0 字节

配置

config_name: default
- 数据文件路径:
  - train: data/train-*
  - validation: data/validation-*
  - test: data/test-*

5,000+

优质数据集

54 个

任务类型

进入经典数据集