procit001/dutch_surname_dataset

Name: procit001/dutch_surname_dataset
Creator: procit001
Published: 2024-07-10 12:43:34
License: 暂无描述

Hugging Face2024-07-10 更新2024-07-22 收录

下载链接：

https://hf-mirror.com/datasets/procit001/dutch_surname_dataset

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集包含文本数据的id、tokens（词序列）、pos_tags（词性标注序列）、chunk_tags（短语块标注序列）和ner_tags（命名实体识别标签序列）。词性标注序列包含多种词性标签，如CC（并列连词）、CD（基数词）、DT（限定词）等。短语块标注序列包含多种短语块标签，如B-ADJP（形容词短语的开始）、I-ADJP（形容词短语的中间部分）等。命名实体识别标签序列包含多种命名实体标签，如B-PER（人名的开始）、I-PER（人名的中间部分）等。数据集分为训练集、验证集和测试集，分别包含29317、3665和3665个样本。数据集的下载大小为732567字节，总大小为2572282字节。

The dataset contains text data with features including id (string type), tokens (sequence of strings), pos_tags (sequence of part-of-speech tags), chunk_tags (sequence of chunk tags), and ner_tags (sequence of named entity recognition tags). The pos_tags sequence includes various part-of-speech labels such as CC (coordinating conjunction), CD (cardinal number), DT (determiner), etc. The chunk_tags sequence includes various chunk labels such as B-ADJP (beginning of adjective phrase), I-ADJP (inside of adjective phrase), etc. The ner_tags sequence includes various named entity labels such as B-PER (beginning of person name), I-PER (inside of person name), etc. The dataset is divided into train, validation, and test sets, containing 29317, 3665, and 3665 examples respectively. The download size of the dataset is 732567 bytes, and the total size is 2572282 bytes.

提供机构：

procit001

原始信息汇总

数据集概述

特征信息

id: 数据类型为字符串。
tokens: 序列类型，数据类型为字符串。
pos_tags: 序列类型，包含以下类别标签：
- 0: "
- 1:
- 2: #
- 3: $
- 4: (
- 5: )
- 6: ,
- 7: .
- 8: :
- 9: ``
- 10: CC
- 11: CD
- 12: DT
- 13: EX
- 14: FW
- 15: IN
- 16: JJ
- 17: JJR
- 18: JJS
- 19: LS
- 20: MD
- 21: NN
- 22: NNP
- 23: NNPS
- 24: NNS
- 25: NN|SYM
- 26: PDT
- 27: POS
- 28: PRP
- 29: PRP$
- 30: RB
- 31: RBR
- 32: RBS
- 33: RP
- 34: SYM
- 35: TO
- 36: UH
- 37: VB
- 38: VBD
- 39: VBG
- 40: VBN
- 41: VBP
- 42: VBZ
- 43: WDT
- 44: WP
- 45: WP$
- 46: WRB
chunk_tags: 序列类型，包含以下类别标签：
- 0: O
- 1: B-ADJP
- 2: I-ADJP
- 3: B-ADVP
- 4: I-ADVP
- 5: B-CONJP
- 6: I-CONJP
- 7: B-INTJ
- 8: I-INTJ
- 9: B-LST
- 10: I-LST
- 11: B-NP
- 12: I-NP
- 13: B-PP
- 14: I-PP
- 15: B-PRT
- 16: I-PRT
- 17: B-SBAR
- 18: I-SBAR
- 19: B-UCP
- 20: I-UCP
- 21: B-VP
- 22: I-VP
ner_tags: 序列类型，包含以下类别标签：
- 0: O
- 1: B-PER
- 2: I-PER
- 3: B-ORG
- 4: I-ORG
- 5: B-LOC
- 6: I-LOC
- 7: B-MISC
- 8: I-MISC

数据集划分

train: 包含29317个样本，大小为2057162字节。
validation: 包含3665个样本，大小为257687字节。
test: 包含3665个样本，大小为257433字节。

数据集大小

下载大小: 732567字节
数据集总大小: 2572282字节

配置信息

config_name: default
- 数据文件路径:
  - train: data/train-*
  - validation: data/validation-*
  - test: data/test-*

搜集汇总

数据集介绍

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集