kinianlo/wikipedia_pos_tagged
收藏Hugging Face2024-04-17 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/kinianlo/wikipedia_pos_tagged
下载链接
链接失效反馈官方服务:
资源简介:
该数据集是Wikipedia数据集的POS标记版本。包含不同版本的数据集:1. nltk - 使用nltk POS标记器进行标记;2. spacy - 使用en_core_web_sm POS标记器进行标记;3. simple - 来自简单英语Wikipedia。
This dataset is a Part-of-Speech (POS) tagged version of the Wikipedia dataset. It includes three distinct variants: 1. nltk variant: tagged using the nltk POS tagger; 2. spacy variant: tagged using the en_core_web_sm POS tagger; 3. simple variant: sourced from Simple English Wikipedia.
提供机构:
kinianlo
原始信息汇总
数据集概述
数据集配置
20220301_en_nltk
- 特征:
- id: string
- url: string
- title: string
- text: string
- pos_tags: sequence of sequence of string
- 分割:
- train:
- 字节数: 88585221192
- 样本数: 6458670
- train:
- 下载大小: 3527644902
- 数据集大小: 88585221192
20220301_en_nltk_tags_only
- 特征:
- id: string
- url: string
- title: string
- pos_tags: sequence of sequence of string
- 分割:
- train:
- 字节数: 68920385173
- 样本数: 6458670
- train:
- 下载大小: 0
- 数据集大小: 68920385173
20220301_simple_nltk
- 特征:
- id: string
- url: string
- title: string
- text: string
- pos_tags: sequence of sequence of string
- 分割:
- train:
- 字节数: 1000903680
- 样本数: 205328
- train:
- 下载大小: 286763992
- 数据集大小: 1000903680
20220301_simple_nltk_tags_only
- 特征:
- id: string
- url: string
- title: string
- pos_tags: sequence of sequence of string
- 分割:
- train:
- 字节数: 783729741
- 样本数: 205328
- train:
- 下载大小: 161414334
- 数据集大小: 783729741
20220301_simple_spacy
- 特征:
- id: string
- url: string
- title: string
- text: string
- pos_tags: sequence of sequence of string
- 分割:
- train:
- 字节数: 1131814443
- 样本数: 205328
- train:
- 下载大小: 289479815
- 数据集大小: 1131814443
20220301_simple_spacy_tags_only
- 特征:
- id: string
- url: string
- title: string
- pos_tags: sequence of sequence of string
- 分割:
- train:
- 字节数: 914640504
- 样本数: 205328
- train:
- 下载大小: 164284823
- 数据集大小: 914640504
数据文件路径
- 20220301_en_nltk:
- train: 20220301_en_nltk/train-*
- 20220301_en_nltk_tags_only:
- train: 20220301_en_nltk_tags_only/train-*
- 20220301_simple_nltk:
- train: 20220301_simple_nltk/train-*
- 20220301_simple_nltk_tags_only:
- train: 20220301_simple_nltk_tags_only/train-*
- 20220301_simple_spacy:
- train: 20220301_simple_spacy/train-*
- 20220301_simple_spacy_tags_only:
- train: 20220301_simple_spacy_tags_only/train-*



