five

Bas95/wikitext

收藏
Hugging Face2026-04-29 更新2026-05-03 收录
下载链接:
https://hf-mirror.com/datasets/Bas95/wikitext
下载链接
链接失效反馈
官方服务:
资源简介:
WikiText语言建模数据集是从维基百科精选文章中提取的超过1亿个标记的集合。与预处理版的Penn Treebank(PTB)相比,WikiText-2大2倍以上,WikiText-103大110倍以上。WikiText数据集还具有更大的词汇量,并保留了原始的大小写、标点和数字。由于它由完整的文章组成,因此非常适合能够利用长期依赖关系的模型。每个子集有两种变体:Raw(用于字符级工作)包含原始标记,Non-raw(用于词级工作)仅包含词汇表中的标记,词汇表外的标记已被替换为<unk>标记。

The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. Compared to the preprocessed version of Penn Treebank (PTB), WikiText-2 is over 2 times larger and WikiText-103 is over 110 times larger. The WikiText dataset also features a far larger vocabulary and retains the original case, punctuation and numbers. As it is composed of full articles, the dataset is well suited for models that can take advantage of long term dependencies. Each subset comes in two different variants: Raw (for character level work) contain the raw tokens, before the addition of the <unk> (unknown) tokens. Non-raw (for word level work) contain only the tokens in their vocabulary. The out-of-vocabulary tokens have been replaced with the the <unk> token.
提供机构:
Bas95
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作