Tej9515/wikitext

Name: Tej9515/wikitext
Creator: Tej9515
Published: 2025-12-16 16:04:00
License: 暂无描述

Hugging Face2025-12-16 更新2025-12-20 收录

下载链接：

https://hf-mirror.com/datasets/Tej9515/wikitext

下载链接

链接失效反馈

官方服务：

资源简介：

WikiText语言建模数据集是从维基百科精选文章中提取的超过1亿个标记的集合。与预处理版的Penn Treebank（PTB）相比，WikiText-2的规模是其2倍以上，WikiText-103则是其110倍以上。该数据集具有更大的词汇量，并保留了原始的大小写、标点和数字，适合研究长期依赖关系的模型。每个子集提供两种变体：原始版本（用于字符级工作）包含原始标记；非原始版本（用于词级工作）仅包含词汇表中的标记，词汇表外的标记用<unk>代替。

The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. Compared to the preprocessed version of Penn Treebank (PTB), WikiText-2 is over 2 times larger and WikiText-103 is over 110 times larger. The dataset features a far larger vocabulary and retains the original case, punctuation and numbers, making it well suited for models that can take advantage of long term dependencies. Each subset comes in two variants: Raw (for character level work) contains the raw tokens, while Non-raw (for word level work) contains only the tokens in their vocabulary with out-of-vocabulary tokens replaced by <unk>.

提供机构：

Tej9515

5,000+

优质数据集

54 个

任务类型

进入经典数据集