procit002/lastNameListDataset_2
收藏Hugging Face2024-06-18 更新2024-06-29 收录
下载链接:
https://hf-mirror.com/datasets/procit002/lastNameListDataset_2
下载链接
链接失效反馈官方服务:
资源简介:
该数据集包含文本数据的标记信息,具体包括词性标注(POS tags)、短语标注(chunk tags)和命名实体识别标注(NER tags)。数据集分为训练集、验证集和测试集,分别包含20157、2520和2520个样本。数据集的下载大小为473730字节,总大小为1536001字节。
This dataset contains tokenized text data with annotations for part-of-speech tags (POS tags), chunk tags, and named entity recognition tags (NER tags). The dataset is divided into training, validation, and test sets, containing 20157, 2520, and 2520 examples respectively. The download size of the dataset is 473730 bytes, and the total size is 1536001 bytes.
提供机构:
procit002
原始信息汇总
数据集概述
数据集特征
- id: 数据项的唯一标识符,数据类型为字符串。
- tokens: 文本序列,数据类型为字符串序列。
- pos_tags: 词性标签序列,包含以下类别:
- 0: "
- 1:
- 2: #
- 3: $
- 4: (
- 5: )
- 6: ,
- 7: .
- 8: :
- 9: ``
- 10: CC
- 11: CD
- 12: DT
- 13: EX
- 14: FW
- 15: IN
- 16: JJ
- 17: JJR
- 18: JJS
- 19: LS
- 20: MD
- 21: NN
- 22: NNP
- 23: NNPS
- 24: NNS
- 25: NN|SYM
- 26: PDT
- 27: POS
- 28: PRP
- 29: PRP$
- 30: RB
- 31: RBR
- 32: RBS
- 33: RP
- 34: SYM
- 35: TO
- 36: UH
- 37: VB
- 38: VBD
- 39: VBG
- 40: VBN
- 41: VBP
- 42: VBZ
- 43: WDT
- 44: WP
- 45: WP$
- 46: WRB
- chunk_tags: 短语结构标签序列,包含以下类别:
- 0: O
- 1: B-ADJP
- 2: I-ADJP
- 3: B-ADVP
- 4: I-ADVP
- 5: B-CONJP
- 6: I-CONJP
- 7: B-INTJ
- 8: I-INTJ
- 9: B-LST
- 10: I-LST
- 11: B-NP
- 12: I-NP
- 13: B-PP
- 14: I-PP
- 15: B-PRT
- 16: I-PRT
- 17: B-SBAR
- 18: I-SBAR
- 19: B-UCP
- 20: I-UCP
- 21: B-VP
- 22: I-VP
- ner_tags: 命名实体识别标签序列,包含以下类别:
- 0: O
- 1: B-PER
- 2: I-PER
- 3: B-ORG
- 4: I-ORG
- 5: B-LOC
- 6: I-LOC
- 7: B-MISC
- 8: I-MISC
数据集划分
- train: 训练集,包含20157个样本,大小为1228749字节。
- validation: 验证集,包含2520个样本,大小为153557字节。
- test: 测试集,包含2520个样本,大小为153695字节。
数据集大小
- 下载大小: 473730字节
- 总大小: 1536001字节
配置
- config_name: default
- data_files:
- train: data/train-*
- validation: data/validation-*
- test: data/test-*
- data_files:



