procit001/dutch_surname_dataset
收藏Hugging Face2024-07-10 更新2024-07-22 收录
下载链接:
https://hf-mirror.com/datasets/procit001/dutch_surname_dataset
下载链接
链接失效反馈官方服务:
资源简介:
该数据集包含文本数据的id、tokens(词序列)、pos_tags(词性标注序列)、chunk_tags(短语块标注序列)和ner_tags(命名实体识别标签序列)。词性标注序列包含多种词性标签,如CC(并列连词)、CD(基数词)、DT(限定词)等。短语块标注序列包含多种短语块标签,如B-ADJP(形容词短语的开始)、I-ADJP(形容词短语的中间部分)等。命名实体识别标签序列包含多种命名实体标签,如B-PER(人名的开始)、I-PER(人名的中间部分)等。数据集分为训练集、验证集和测试集,分别包含29317、3665和3665个样本。数据集的下载大小为732567字节,总大小为2572282字节。
The dataset contains text data with features including id (string type), tokens (sequence of strings), pos_tags (sequence of part-of-speech tags), chunk_tags (sequence of chunk tags), and ner_tags (sequence of named entity recognition tags). The pos_tags sequence includes various part-of-speech labels such as CC (coordinating conjunction), CD (cardinal number), DT (determiner), etc. The chunk_tags sequence includes various chunk labels such as B-ADJP (beginning of adjective phrase), I-ADJP (inside of adjective phrase), etc. The ner_tags sequence includes various named entity labels such as B-PER (beginning of person name), I-PER (inside of person name), etc. The dataset is divided into train, validation, and test sets, containing 29317, 3665, and 3665 examples respectively. The download size of the dataset is 732567 bytes, and the total size is 2572282 bytes.
提供机构:
procit001
原始信息汇总
数据集概述
特征信息
- id: 数据类型为字符串。
- tokens: 序列类型,数据类型为字符串。
- pos_tags: 序列类型,包含以下类别标签:
- 0: "
- 1:
- 2: #
- 3: $
- 4: (
- 5: )
- 6: ,
- 7: .
- 8: :
- 9: ``
- 10: CC
- 11: CD
- 12: DT
- 13: EX
- 14: FW
- 15: IN
- 16: JJ
- 17: JJR
- 18: JJS
- 19: LS
- 20: MD
- 21: NN
- 22: NNP
- 23: NNPS
- 24: NNS
- 25: NN|SYM
- 26: PDT
- 27: POS
- 28: PRP
- 29: PRP$
- 30: RB
- 31: RBR
- 32: RBS
- 33: RP
- 34: SYM
- 35: TO
- 36: UH
- 37: VB
- 38: VBD
- 39: VBG
- 40: VBN
- 41: VBP
- 42: VBZ
- 43: WDT
- 44: WP
- 45: WP$
- 46: WRB
- chunk_tags: 序列类型,包含以下类别标签:
- 0: O
- 1: B-ADJP
- 2: I-ADJP
- 3: B-ADVP
- 4: I-ADVP
- 5: B-CONJP
- 6: I-CONJP
- 7: B-INTJ
- 8: I-INTJ
- 9: B-LST
- 10: I-LST
- 11: B-NP
- 12: I-NP
- 13: B-PP
- 14: I-PP
- 15: B-PRT
- 16: I-PRT
- 17: B-SBAR
- 18: I-SBAR
- 19: B-UCP
- 20: I-UCP
- 21: B-VP
- 22: I-VP
- ner_tags: 序列类型,包含以下类别标签:
- 0: O
- 1: B-PER
- 2: I-PER
- 3: B-ORG
- 4: I-ORG
- 5: B-LOC
- 6: I-LOC
- 7: B-MISC
- 8: I-MISC
数据集划分
- train: 包含29317个样本,大小为2057162字节。
- validation: 包含3665个样本,大小为257687字节。
- test: 包含3665个样本,大小为257433字节。
数据集大小
- 下载大小: 732567字节
- 数据集总大小: 2572282字节
配置信息
- config_name: default
- 数据文件路径:
- train: data/train-*
- validation: data/validation-*
- test: data/test-*
- 数据文件路径:
搜集汇总
数据集介绍

以上内容由遇见数据集搜集并总结生成



