five

dat-hassan/wikitext2

收藏
Hugging Face2026-01-03 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/dat-hassan/wikitext2
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - no-annotation language_creators: - crowdsourced language: - en license: - cc-by-sa-3.0 - gfdl multilinguality: - monolingual paperswithcode_id: wikitext-2 pretty_name: WikiText size_categories: - 1M<n<10M source_datasets: - original task_categories: - text-generation - fill-mask task_ids: - language-modeling - masked-language-modeling dataset_info: - config_name: wikitext-103-v1 features: - name: text dtype: string splits: - name: test num_bytes: 1295579 num_examples: 4358 - name: train num_bytes: 545142639 num_examples: 1801350 - name: validation num_bytes: 1154755 num_examples: 3760 download_size: 190229076 dataset_size: 547592973 - config_name: wikitext-2-v1 features: - name: text dtype: string splits: - name: test num_bytes: 1270951 num_examples: 4358 - name: train num_bytes: 10918134 num_examples: 36718 - name: validation num_bytes: 1134127 num_examples: 3760 download_size: 4475746 dataset_size: 13323212 - config_name: wikitext-103-raw-v1 features: - name: text dtype: string splits: - name: test num_bytes: 1305092 num_examples: 4358 - name: train num_bytes: 546501673 num_examples: 1801350 - name: validation num_bytes: 1159292 num_examples: 3760 download_size: 191984949 dataset_size: 548966057 - config_name: wikitext-2-raw-v1 features: - name: text dtype: string splits: - name: test num_bytes: 1305092 num_examples: 4358 - name: train num_bytes: 11061733 num_examples: 36718 - name: validation num_bytes: 1159292 num_examples: 3760 download_size: 4721645 dataset_size: 13526117 --- # Dataset Card for "wikitext" ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/) - **Repository:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **Paper:** [Pointer Sentinel Mixture Models](https://arxiv.org/abs/1609.07843) - **Point of Contact:** [Stephen Merity](mailto:smerity@salesforce.com) - **Size of downloaded dataset files:** 391.41 MB - **Size of the generated dataset:** 1.12 GB - **Total amount of disk used:** 1.52 GB ### Dataset Summary The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License. Compared to the preprocessed version of Penn Treebank (PTB), WikiText-2 is over 2 times larger and WikiText-103 is over 110 times larger. The WikiText dataset also features a far larger vocabulary and retains the original case, punctuation and numbers - all of which are removed in PTB. As it is composed of full articles, the dataset is well suited for models that can take advantage of long term dependencies. Each subset comes in two different variants: - Raw (for character level work) contain the raw tokens, before the addition of the <unk> (unknown) tokens. - Non-raw (for word level work) contain only the tokens in their vocabulary (wiki.train.tokens, wiki.valid.tokens, and wiki.test.tokens). The out-of-vocabulary tokens have been replaced with the the <unk> token. ### Supported Tasks and Leaderboards [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Languages [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Dataset Structure ### Data Instances #### wikitext-103-raw-v1 - **Size of downloaded dataset files:** 191.98 MB - **Size of the generated dataset:** 549.42 MB - **Total amount of disk used:** 741.41 MB An example of 'validation' looks as follows. ``` This example was too long and was cropped: { "text": "\" The gold dollar or gold one @-@ dollar piece was a coin struck as a regular issue by the United States Bureau of the Mint from..." } ``` #### wikitext-103-v1 - **Size of downloaded dataset files:** 190.23 MB - **Size of the generated dataset:** 548.05 MB - **Total amount of disk used:** 738.27 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "text": "\" Senjō no Valkyria 3 : <unk> Chronicles ( Japanese : 戦場のヴァルキュリア3 , lit . Valkyria of the Battlefield 3 ) , commonly referred to..." } ``` #### wikitext-2-raw-v1 - **Size of downloaded dataset files:** 4.72 MB - **Size of the generated dataset:** 13.54 MB - **Total amount of disk used:** 18.26 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "text": "\" The Sinclair Scientific Programmable was introduced in 1975 , with the same case as the Sinclair Oxford . It was larger than t..." } ``` #### wikitext-2-v1 - **Size of downloaded dataset files:** 4.48 MB - **Size of the generated dataset:** 13.34 MB - **Total amount of disk used:** 17.82 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "text": "\" Senjō no Valkyria 3 : <unk> Chronicles ( Japanese : 戦場のヴァルキュリア3 , lit . Valkyria of the Battlefield 3 ) , commonly referred to..." } ``` ### Data Fields The data fields are the same among all splits. #### wikitext-103-raw-v1 - `text`: a `string` feature. #### wikitext-103-v1 - `text`: a `string` feature. #### wikitext-2-raw-v1 - `text`: a `string` feature. #### wikitext-2-v1 - `text`: a `string` feature. ### Data Splits | name | train |validation|test| |-------------------|------:|---------:|---:| |wikitext-103-raw-v1|1801350| 3760|4358| |wikitext-103-v1 |1801350| 3760|4358| |wikitext-2-raw-v1 | 36718| 3760|4358| |wikitext-2-v1 | 36718| 3760|4358| ## Dataset Creation ### Curation Rationale [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Source Data #### Initial Data Collection and Normalization [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### Who are the source language producers? [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Annotations #### Annotation process [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### Who are the annotators? [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Personal and Sensitive Information [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Discussion of Biases [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Other Known Limitations [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Additional Information ### Dataset Curators [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Licensing Information The dataset is available under the [Creative Commons Attribution-ShareAlike License (CC BY-SA 4.0)](https://creativecommons.org/licenses/by-sa/4.0/). ### Citation Information ``` @misc{merity2016pointer, title={Pointer Sentinel Mixture Models}, author={Stephen Merity and Caiming Xiong and James Bradbury and Richard Socher}, year={2016}, eprint={1609.07843}, archivePrefix={arXiv}, primaryClass={cs.CL} } ``` ### Contributions Thanks to [@thomwolf](https://github.com/thomwolf), [@lewtun](https://github.com/lewtun), [@patrickvonplaten](https://github.com/patrickvonplaten), [@mariamabarham](https://github.com/mariamabarham) for adding this dataset.

annotations_creators: - 无注释(no-annotation) language_creators: - 众包(crowdsourced) language: - 英语(en) license: - 知识共享署名-相同方式共享3.0(cc-by-sa-3.0) - GNU自由文档协议(gfdl) multilinguality: - 单语言(monolingual) paperswithcode_id: wikitext-2 pretty_name: WikiText size_categories: - 100万<n<1000万 source_datasets: - 原始数据集(original) task_categories: - 文本生成(text-generation) - 掩码填空(fill-mask) task_ids: - 语言建模(language-modeling) - 掩码语言建模(masked language modeling) dataset_info: - config_name: wikitext-103-v1 features: - name: text dtype: 字符串(string) splits: - name: test num_bytes: 1295579 num_examples: 4358 - name: train num_bytes: 545142639 num_examples: 1801350 - name: validation num_bytes: 1154755 num_examples: 3760 download_size: 190229076 dataset_size: 547592973 - config_name: wikitext-2-v1 features: - name: text dtype: 字符串(string) splits: - name: test num_bytes: 1270951 num_examples: 4358 - name: train num_bytes: 10918134 num_examples: 36718 - name: validation num_bytes: 1134127 num_examples: 3760 download_size: 4475746 dataset_size: 13323212 - config_name: wikitext-103-raw-v1 features: - name: text dtype: 字符串(string) splits: - name: test num_bytes: 1305092 num_examples: 4358 - name: train num_bytes: 546501673 num_examples: 1801350 - name: validation num_bytes: 1159292 num_examples: 3760 download_size: 191984949 dataset_size: 548966057 - config_name: wikitext-2-raw-v1 features: - name: text dtype: 字符串(string) splits: - name: test num_bytes: 1305092 num_examples: 4358 - name: train num_bytes: 11061733 num_examples: 36718 - name: validation num_bytes: 1159292 num_examples: 3760 download_size: 4721645 dataset_size: 13526117 # "wikitext" 数据集卡片 ## 目录 - [数据集描述](#dataset-description) - [数据集概述](#dataset-summary) - [支持任务与排行榜](#supported-tasks-and-leaderboards) - [语言](#languages) - [数据集结构](#dataset-structure) - [数据实例](#data-instances) - [数据字段](#data-fields) - [数据划分](#data-splits) - [数据集构建](#dataset-creation) - [数据集筛选依据](#curation-rationale) - [源数据](#source-data) - [注释信息](#annotations) - [个人与敏感信息](#personal-and-sensitive-information) - [数据集使用注意事项](#considerations-for-using-the-data) - [数据集的社会影响](#social-impact-of-dataset) - [偏差讨论](#discussion-of-biases) - [其他已知局限性](#other-known-limitations) - [附加信息](#additional-information) - [数据集整理者](#dataset-curators) - [许可信息](#licensing-information) - [引用信息](#citation-information) - [贡献致谢](#contributions) ## 数据集描述 - **主页:** [https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/) - **代码仓库:** [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **论文:** [《Pointer Sentinel Mixture Models》](https://arxiv.org/abs/1609.07843) - **联系人:** [Stephen Merity](mailto:smerity@salesforce.com) - **下载数据集总大小:** 391.41 MB - **生成后数据集总大小:** 1.12 GB - **总磁盘占用:** 1.52 GB ### 数据集概述 WikiText语言建模数据集是从维基百科已验证的优质(Good)与特色(Featured)文章集中提取的超1亿Token(Token)的集合。该数据集采用知识共享署名-相同方式共享许可协议发布。 相较于预处理版的Penn Treebank(PTB),WikiText-2的规模是其2倍以上,WikiText-103的规模更是其110倍以上。WikiText数据集同时拥有更大的词汇表,且保留了原始大小写、标点与数字——这些内容在PTB中均被移除。由于该数据集由完整文章构成,因此非常适用于能够利用长期依赖关系的模型。 每个子集包含两种变体: - 原始版(raw,适用于字符级任务):包含添加未知词(<unk>)Token(Token)之前的原始Token(Token)。 - 非原始版(non-raw,适用于词级任务):仅包含词汇表中的Token(Token)(对应文件为wiki.train.tokens、wiki.valid.tokens与wiki.test.tokens)。其中,未登录词已被替换为<unk> Token(Token)。 ### 支持任务与排行榜 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 语言 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## 数据集结构 ### 数据实例 #### wikitext-103-raw-v1 - **下载数据集大小:** 191.98 MB - **生成后数据集大小:** 549.42 MB - **总磁盘占用:** 741.41 MB "验证集(validation)"的一个示例如下: 该示例过长已被截断: { "text": "" The gold dollar or gold one @-@ dollar piece was a coin struck as a regular issue by the United States Bureau of the Mint from..." } #### wikitext-103-v1 - **下载数据集大小:** 190.23 MB - **生成后数据集大小:** 548.05 MB - **总磁盘占用:** 738.27 MB "训练集(train)"的一个示例如下: 该示例过长已被截断: { "text": "" Senjō no Valkyria 3 : <unk> Chronicles ( Japanese : 戦場のヴァルキュリア3 , lit . Valkyria of the Battlefield 3 ) , commonly referred to..." } #### wikitext-2-raw-v1 - **下载数据集大小:** 4.72 MB - **生成后数据集大小:** 13.54 MB - **总磁盘占用:** 18.26 MB "训练集(train)"的一个示例如下: 该示例过长已被截断: { "text": "" The Sinclair Scientific Programmable was introduced in 1975 , with the same case as the Sinclair Oxford . It was larger than t..." } #### wikitext-2-v1 - **下载数据集大小:** 4.48 MB - **生成后数据集大小:** 13.34 MB - **总磁盘占用:** 17.82 MB "训练集(train)"的一个示例如下: 该示例过长已被截断: { "text": "" Senjō no Valkyria 3 : <unk> Chronicles ( Japanese : 戦場のヴァルキュリア3 , lit . Valkyria of the Battlefield 3 ) , commonly referred to..." } ### 数据字段 所有划分的数据字段均保持一致。 #### wikitext-103-raw-v1 - `text`: 字符串类型特征。 #### wikitext-103-v1 - `text`: 字符串类型特征。 #### wikitext-2-raw-v1 - `text`: 字符串类型特征。 #### wikitext-2-v1 - `text`: 字符串类型特征。 ### 数据划分 | 数据集配置名称 | 训练样本数 | 验证样本数 | 测试样本数 | |-------------------|------:|---------:|---:| |wikitext-103-raw-v1|1801350| 3760|4358| |wikitext-103-v1 |1801350| 3760|4358| |wikitext-2-raw-v1 | 36718| 3760|4358| |wikitext-2-v1 | 36718| 3760|4358| ## 数据集构建 ### 数据集筛选依据 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 源数据 #### 初始数据收集与规范化 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### 源语言生产者是谁? [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 注释信息 #### 注释流程 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### 注释者是谁? [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 个人与敏感信息 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## 数据集使用注意事项 ### 数据集的社会影响 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 偏差讨论 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 其他已知局限性 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## 附加信息 ### 数据集整理者 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 许可信息 本数据集采用[知识共享署名-相同方式共享许可协议(CC BY-SA 4.0)](https://creativecommons.org/licenses/by-sa/4.0/)发布。 ### 引用信息 @misc{merity2016pointer, title={Pointer Sentinel Mixture Models}, author={Stephen Merity and Caiming Xiong and James Bradbury and Richard Socher}, year={2016}, eprint={1609.07843}, archivePrefix={arXiv}, primaryClass={cs.CL} } ### 贡献致谢 感谢 [@thomwolf](https://github.com/thomwolf)、[@lewtun](https://github.com/lewtun)、[@patrickvonplaten](https://github.com/patrickvonplaten)、[@mariamabarham](https://github.com/mariamabarham) 为本数据集的添加提供支持。
提供机构:
dat-hassan
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作