five

procit001/final_dataset_surname_firstname

收藏
Hugging Face2024-07-10 更新2024-07-22 收录
下载链接:
https://hf-mirror.com/datasets/procit001/final_dataset_surname_firstname
下载链接
链接失效反馈
官方服务:
资源简介:
该数据集包含四个主要特征:id、tokens、pos_tags和chunk_tags、ner_tags。其中,pos_tags、chunk_tags和ner_tags都是序列类型,且每个标签都有对应的类别名称。数据集分为三个部分:train、test和validation,每个部分都有对应的字节大小和样本数量。此外,还提供了数据集的下载大小和总大小。

This dataset is primarily used for natural language processing tasks, including part-of-speech tagging, chunking, and named entity recognition. It contains four main features: id (string type), tokens (string sequence), pos_tags (part-of-speech tag sequence), chunk_tags (chunk tag sequence), and ner_tags (named entity recognition tag sequence). The part-of-speech tags map from punctuation to various part-of-speech categories, such as nouns, verbs, etc. The chunk tags are used to identify the beginning and internal parts of phrases. The named entity recognition tags are used to identify the beginning and internal parts of entities such as persons, organizations, locations, etc. The dataset is divided into three parts: train, test, and validation, each with corresponding sizes and number of examples.
提供机构:
procit001
原始信息汇总

数据集概述

特征信息

  • id: 数据类型为字符串。
  • tokens: 序列类型,数据类型为字符串。
  • pos_tags: 序列类型,包含类别标签,具体标签如下:
    • 0: "
    • 1:
    • 2: #
    • 3: $
    • 4: (
    • 5: )
    • 6: ,
    • 7: .
    • 8: :
    • 9: ``
    • 10: CC
    • 11: CD
    • 12: DT
    • 13: EX
    • 14: FW
    • 15: IN
    • 16: JJ
    • 17: JJR
    • 18: JJS
    • 19: LS
    • 20: MD
    • 21: NN
    • 22: NNP
    • 23: NNPS
    • 24: NNS
    • 25: NN|SYM
    • 26: PDT
    • 27: POS
    • 28: PRP
    • 29: PRP$
    • 30: RB
    • 31: RBR
    • 32: RBS
    • 33: RP
    • 34: SYM
    • 35: TO
    • 36: UH
    • 37: VB
    • 38: VBD
    • 39: VBG
    • 40: VBN
    • 41: VBP
    • 42: VBZ
    • 43: WDT
    • 44: WP
    • 45: WP$
    • 46: WRB
  • chunk_tags: 序列类型,包含类别标签,具体标签如下:
    • 0: O
    • 1: B-ADJP
    • 2: I-ADJP
    • 3: B-ADVP
    • 4: I-ADVP
    • 5: B-CONJP
    • 6: I-CONJP
    • 7: B-INTJ
    • 8: I-INTJ
    • 9: B-LST
    • 10: I-LST
    • 11: B-NP
    • 12: I-NP
    • 13: B-PP
    • 14: I-PP
    • 15: B-PRT
    • 16: I-PRT
    • 17: B-SBAR
    • 18: I-SBAR
    • 19: B-UCP
    • 20: I-UCP
    • 21: B-VP
    • 22: I-VP
  • ner_tags: 序列类型,包含类别标签,具体标签如下:
    • 0: O
    • 1: B-PER
    • 2: I-PER
    • 3: B-ORG
    • 4: I-ORG
    • 5: B-LOC
    • 6: I-LOC
    • 7: B-MISC
    • 8: I-MISC

数据分割

  • train: 包含372566个样本,大小为26940618字节。
  • test: 包含46573个样本,大小为3368756字节。
  • validation: 包含46572个样本,大小为3361059字节。

数据集大小

  • 下载大小: 9865774字节
  • 数据集大小: 33670433字节

配置信息

  • config_name: default
  • data_files:
    • train: data/train-*
    • test: data/test-*
    • validation: data/validation-*
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作