procit001/final_dataset_surname_firstname

Name: procit001/final_dataset_surname_firstname
Creator: procit001
Published: 2024-07-10 12:47:19
License: 暂无描述

Hugging Face2024-07-10 更新2024-07-22 收录

下载链接：

https://hf-mirror.com/datasets/procit001/final_dataset_surname_firstname

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集包含四个主要特征：id、tokens、pos_tags和chunk_tags、ner_tags。其中，pos_tags、chunk_tags和ner_tags都是序列类型，且每个标签都有对应的类别名称。数据集分为三个部分：train、test和validation，每个部分都有对应的字节大小和样本数量。此外，还提供了数据集的下载大小和总大小。

This dataset is primarily used for natural language processing tasks, including part-of-speech tagging, chunking, and named entity recognition. It contains four main features: id (string type), tokens (string sequence), pos_tags (part-of-speech tag sequence), chunk_tags (chunk tag sequence), and ner_tags (named entity recognition tag sequence). The part-of-speech tags map from punctuation to various part-of-speech categories, such as nouns, verbs, etc. The chunk tags are used to identify the beginning and internal parts of phrases. The named entity recognition tags are used to identify the beginning and internal parts of entities such as persons, organizations, locations, etc. The dataset is divided into three parts: train, test, and validation, each with corresponding sizes and number of examples.

提供机构：

procit001

原始信息汇总

数据集概述

特征信息

id: 数据类型为字符串。
tokens: 序列类型，数据类型为字符串。
pos_tags: 序列类型，包含类别标签，具体标签如下：
- 0: "
- 1:
- 2: #
- 3: $
- 4: (
- 5: )
- 6: ,
- 7: .
- 8: :
- 9: ``
- 10: CC
- 11: CD
- 12: DT
- 13: EX
- 14: FW
- 15: IN
- 16: JJ
- 17: JJR
- 18: JJS
- 19: LS
- 20: MD
- 21: NN
- 22: NNP
- 23: NNPS
- 24: NNS
- 25: NN|SYM
- 26: PDT
- 27: POS
- 28: PRP
- 29: PRP$
- 30: RB
- 31: RBR
- 32: RBS
- 33: RP
- 34: SYM
- 35: TO
- 36: UH
- 37: VB
- 38: VBD
- 39: VBG
- 40: VBN
- 41: VBP
- 42: VBZ
- 43: WDT
- 44: WP
- 45: WP$
- 46: WRB
chunk_tags: 序列类型，包含类别标签，具体标签如下：
- 0: O
- 1: B-ADJP
- 2: I-ADJP
- 3: B-ADVP
- 4: I-ADVP
- 5: B-CONJP
- 6: I-CONJP
- 7: B-INTJ
- 8: I-INTJ
- 9: B-LST
- 10: I-LST
- 11: B-NP
- 12: I-NP
- 13: B-PP
- 14: I-PP
- 15: B-PRT
- 16: I-PRT
- 17: B-SBAR
- 18: I-SBAR
- 19: B-UCP
- 20: I-UCP
- 21: B-VP
- 22: I-VP
ner_tags: 序列类型，包含类别标签，具体标签如下：
- 0: O
- 1: B-PER
- 2: I-PER
- 3: B-ORG
- 4: I-ORG
- 5: B-LOC
- 6: I-LOC
- 7: B-MISC
- 8: I-MISC

数据分割

train: 包含372566个样本，大小为26940618字节。
test: 包含46573个样本，大小为3368756字节。
validation: 包含46572个样本，大小为3361059字节。

数据集大小

下载大小: 9865774字节
数据集大小: 33670433字节

配置信息

config_name: default
data_files:
- train: data/train-*
- test: data/test-*
- validation: data/validation-*

5,000+

优质数据集

54 个

任务类型

进入经典数据集