yuji96/enwiki_tagged
收藏Hugging Face2024-09-04 更新2024-12-14 收录
下载链接:
https://hf-mirror.com/datasets/yuji96/enwiki_tagged
下载链接
链接失效反馈官方服务:
资源简介:
enwiki_tagged是一个基于wikimedia/wikipedia(20231101.en版本)的数据集,对整个文本进行了词性标注。使用了nltk库的word_tokenize和pos_tag方法进行分词和词性标注。数据集包含原始文本、分词结果、词性标注、ID、URL、标题等特征。数据集仅包含训练集,大小为74491455326字节,包含6407814个示例。
The enwiki_tagged dataset is derived from the text of [wikimedia/wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia) (20231101.en) with part-of-speech tags applied. Word tokenization and POS tagging were processed using nltk.word_tokenize and nltk.pos_tag, respectively. The dataset includes multiple features such as text, part-of-speech tags, ID, URL, title, etc. The training set contains 6,407,814 samples, and the dataset size is 74,491,455,326 bytes.
提供机构:
yuji96



