wikitext_document_level

Name: wikitext_document_level
Creator: maas
Published: 2025-12-05 16:46:21
License: 暂无描述

魔搭社区2025-12-05 更新2025-08-16 收录

下载链接：

https://modelscope.cn/datasets/EleutherAI/wikitext_document_level

下载链接

链接失效反馈

官方服务：

资源简介：

# Wikitext Document Level This is a modified version of [https://huggingface.co/datasets/wikitext](https://huggingface.co/datasets/wikitext) that returns Wiki pages instead of Wiki text line-by-line. The original readme is contained below. # Dataset Card for "wikitext" ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/) - **Repository:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **Paper:** [Pointer Sentinel Mixture Models](https://arxiv.org/abs/1609.07843) - **Point of Contact:** [Stephen Merity](mailto:smerity@salesforce.com) - **Size of downloaded dataset files:** 373.28 MB - **Size of the generated dataset:** 1072.25 MB - **Total amount of disk used:** 1445.53 MB ### Dataset Summary The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License. Compared to the preprocessed version of Penn Treebank (PTB), WikiText-2 is over 2 times larger and WikiText-103 is over 110 times larger. The WikiText dataset also features a far larger vocabulary and retains the original case, punctuation and numbers - all of which are removed in PTB. As it is composed of full articles, the dataset is well suited for models that can take advantage of long term dependencies. ### Supported Tasks and Leaderboards [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Languages [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Dataset Structure ### Data Instances #### wikitext-103-raw-v1 - **Size of downloaded dataset files:** 183.09 MB - **Size of the generated dataset:** 523.97 MB - **Total amount of disk used:** 707.06 MB An example of 'validation' looks as follows. ``` This example was too long and was cropped: { "text": "\" The gold dollar or gold one @-@ dollar piece was a coin struck as a regular issue by the United States Bureau of the Mint from..." } ``` #### wikitext-103-v1 - **Size of downloaded dataset files:** 181.42 MB - **Size of the generated dataset:** 522.66 MB - **Total amount of disk used:** 704.07 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "text": "\" Senjō no Valkyria 3 : <unk> Chronicles ( Japanese : 戦場のヴァルキュリア3 , lit . Valkyria of the Battlefield 3 ) , commonly referred to..." } ``` #### wikitext-2-raw-v1 - **Size of downloaded dataset files:** 4.50 MB - **Size of the generated dataset:** 12.91 MB - **Total amount of disk used:** 17.41 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "text": "\" The Sinclair Scientific Programmable was introduced in 1975 , with the same case as the Sinclair Oxford . It was larger than t..." } ``` #### wikitext-2-v1 - **Size of downloaded dataset files:** 4.27 MB - **Size of the generated dataset:** 12.72 MB - **Total amount of disk used:** 16.99 MB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "text": "\" Senjō no Valkyria 3 : <unk> Chronicles ( Japanese : 戦場のヴァルキュリア3 , lit . Valkyria of the Battlefield 3 ) , commonly referred to..." } ``` ### Data Fields The data fields are the same among all splits. #### wikitext-103-raw-v1 - `text`: a `string` feature. #### wikitext-103-v1 - `text`: a `string` feature. #### wikitext-2-raw-v1 - `text`: a `string` feature. #### wikitext-2-v1 - `text`: a `string` feature. ### Data Splits | name | train |validation|test| |-------------------|------:|---------:|---:| |wikitext-103-raw-v1|1801350| 3760|4358| |wikitext-103-v1 |1801350| 3760|4358| |wikitext-2-raw-v1 | 36718| 3760|4358| |wikitext-2-v1 | 36718| 3760|4358| ## Dataset Creation ### Curation Rationale [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Source Data #### Initial Data Collection and Normalization [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### Who are the source language producers? [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Annotations #### Annotation process [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### Who are the annotators? [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Personal and Sensitive Information [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Discussion of Biases [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Other Known Limitations [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Additional Information ### Dataset Curators [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Licensing Information The dataset is available under the [Creative Commons Attribution-ShareAlike License (CC BY-SA 4.0)](https://creativecommons.org/licenses/by-sa/4.0/). ### Citation Information ``` @misc{merity2016pointer, title={Pointer Sentinel Mixture Models}, author={Stephen Merity and Caiming Xiong and James Bradbury and Richard Socher}, year={2016}, eprint={1609.07843}, archivePrefix={arXiv}, primaryClass={cs.CL} } ``` ### Contributions Thanks to [@thomwolf](https://github.com/thomwolf), [@lewtun](https://github.com/lewtun), [@patrickvonplaten](https://github.com/patrickvonplaten), [@mariamabarham](https://github.com/mariamabarham) for adding this dataset.

# 文档级Wikitext数据集本数据集是对[https://huggingface.co/datasets/wikitext](https://huggingface.co/datasets/wikitext)的修改版本，不再逐行返回维基百科文本，而是直接返回完整维基页面。原始自述文件如下所示。 # "wikitext"数据集卡片 ## 目录 - [数据集描述](#dataset-description) - [数据集摘要](#dataset-summary) - [支持任务与排行榜](#supported-tasks-and-leaderboards) - [语言](#languages) - [数据集结构](#dataset-structure) - [数据实例](#data-instances) - [数据字段](#data-fields) - [数据划分](#data-splits) - [数据集构建](#dataset-creation) - [构建初衷](#curation-rationale) - [源数据](#source-data) - [标注信息](#annotations) - [个人与敏感信息](#personal-and-sensitive-information) - [数据集使用注意事项](#considerations-for-using-the-data) - [数据集的社会影响](#social-impact-of-dataset) - [偏差讨论](#discussion-of-biases) - [其他已知局限性](#other-known-limitations) - [附加信息](#additional-information) - [数据集维护者](#dataset-curators) - [许可信息](#licensing-information) - [引用信息](#citation-information) - [贡献者](#contributions) ## 数据集描述 - **主页：** [https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/](https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/) - **仓库：** [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **论文：** [Pointer Sentinel Mixture Models](https://arxiv.org/abs/1609.07843) - **联系方式：** [Stephen Merity](mailto:smerity@salesforce.com) - **下载数据集文件大小：** 373.28 MB - **生成后数据集大小：** 1072.25 MB - **总磁盘占用：** 1445.53 MB ### 数据集摘要 Wikitext语言建模数据集是从维基百科中经过验证的优质（Good）与特色（Featured）文章集合中提取的超过1亿个Token (Token)的语料库，本数据集采用知识共享署名-相同方式共享许可协议发布。与预处理后的Penn Treebank（PTB）数据集相比，WikiText-2的规模是其2倍以上，WikiText-103的规模更是其110倍以上。Wikitext数据集还拥有更大的词表，并保留了原始的大小写、标点符号与数字——这些内容在PTB数据集中均被移除。由于该数据集由完整文章构成，因此非常适合能够利用长距离依赖关系的模型。 ### 支持任务与排行榜 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 语言 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## 数据集结构 ### 数据实例 #### wikitext-103-raw-v1 - **下载数据集文件大小：** 183.09 MB - **生成后数据集大小：** 523.97 MB - **总磁盘占用：** 707.06 MB 本示例过长，已被截断： { "text": "" The gold dollar or gold one @-@ dollar piece was a coin struck as a regular issue by the United States Bureau of the Mint from..." } #### wikitext-103-v1 - **下载数据集文件大小：** 181.42 MB - **生成后数据集大小：** 522.66 MB - **总磁盘占用：** 704.07 MB 本示例过长，已被截断： { "text": "" Senjō no Valkyria 3 : <unk> Chronicles ( Japanese : 戦場のヴァルキュリア3 , lit . Valkyria of the Battlefield 3 ) , commonly referred to..." } #### wikitext-2-raw-v1 - **下载数据集文件大小：** 4.50 MB - **生成后数据集大小：** 12.91 MB - **总磁盘占用：** 17.41 MB 本示例过长，已被截断： { "text": "" The Sinclair Scientific Programmable was introduced in 1975 , with the same case as the Sinclair Oxford . It was larger than t..." } #### wikitext-2-v1 - **下载数据集文件大小：** 4.27 MB - **生成后数据集大小：** 12.72 MB - **总磁盘占用：** 16.99 MB 本示例过长，已被截断： { "text": "" Senjō no Valkyria 3 : <unk> Chronicles ( Japanese : 戦場のヴァルキュリア3 , lit . Valkyria of the Battlefield 3 ) , commonly referred to..." } ### 数据字段所有划分的数据字段均保持一致。 #### wikitext-103-raw-v1 - `text`：字符串类型特征。 #### wikitext-103-v1 - `text`：字符串类型特征。 #### wikitext-2-raw-v1 - `text`：字符串类型特征。 #### wikitext-2-v1 - `text`：字符串类型特征。 ### 数据划分 | 数据集名称 | 训练集样本数 |验证集样本数|测试集样本数| |-------------------|------:|---------:|---:| |wikitext-103-raw-v1|1801350| 3760|4358| |wikitext-103-v1 |1801350| 3760|4358| |wikitext-2-raw-v1 | 36718| 3760|4358| |wikitext-2-v1 | 36718| 3760|4358| ## 数据集构建 ### 构建初衷 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 源数据 #### 初始数据收集与归一化 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### 源语言生产者是谁？ [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 标注信息 #### 标注流程 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### 标注者是谁？ [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 个人与敏感信息 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## 数据集使用注意事项 ### 数据集的社会影响 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 偏差讨论 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 其他已知局限性 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## 附加信息 ### 数据集维护者 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 许可信息本数据集采用[知识共享署名-相同方式共享许可协议（CC BY-SA 4.0）](https://creativecommons.org/licenses/by-sa/4.0/)发布。 ### 引用信息 @misc{merity2016pointer, title={Pointer Sentinel Mixture Models}, author={Stephen Merity and Caiming Xiong and James Bradbury and Richard Socher}, year={2016}, eprint={1609.07843}, archivePrefix={arXiv}, primaryClass={cs.CL} } ### 贡献者感谢[@thomwolf](https://github.com/thomwolf)、[@lewtun](https://github.com/lewtun)、[@patrickvonplaten](https://github.com/patrickvonplaten)与[@mariamabarham](https://github.com/mariamabarham)为本数据集的添加所做的贡献。

提供机构：

maas

创建时间：

2025-08-15

5,000+

优质数据集

54 个

任务类型

进入经典数据集