Bas95/wikitext

Name: Bas95/wikitext
Creator: Bas95
Published: 2026-04-29 22:15:43
License: 暂无描述

Hugging Face2026-04-29 更新2026-05-03 收录

下载链接：

https://hf-mirror.com/datasets/Bas95/wikitext

下载链接

链接失效反馈

官方服务：

资源简介：

WikiText语言建模数据集是从维基百科精选文章中提取的超过1亿个标记的集合。与预处理版的Penn Treebank（PTB）相比，WikiText-2大2倍以上，WikiText-103大110倍以上。WikiText数据集还具有更大的词汇量，并保留了原始的大小写、标点和数字。由于它由完整的文章组成，因此非常适合能够利用长期依赖关系的模型。每个子集有两种变体：Raw（用于字符级工作）包含原始标记，Non-raw（用于词级工作）仅包含词汇表中的标记，词汇表外的标记已被替换为<unk>标记。

The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. Compared to the preprocessed version of Penn Treebank (PTB), WikiText-2 is over 2 times larger and WikiText-103 is over 110 times larger. The WikiText dataset also features a far larger vocabulary and retains the original case, punctuation and numbers. As it is composed of full articles, the dataset is well suited for models that can take advantage of long term dependencies. Each subset comes in two different variants: Raw (for character level work) contain the raw tokens, before the addition of the <unk> (unknown) tokens. Non-raw (for word level work) contain only the tokens in their vocabulary. The out-of-vocabulary tokens have been replaced with the the <unk> token.

提供机构：

Bas95

5,000+

优质数据集

54 个

任务类型

进入经典数据集