five

GuojiXu/openwebtext

收藏
Hugging Face2026-03-15 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/GuojiXu/openwebtext
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - no-annotation language_creators: - found language: - en license: - cc0-1.0 multilinguality: - monolingual pretty_name: OpenWebText size_categories: - 1M<n<10M source_datasets: - original task_categories: - text-generation - fill-mask task_ids: - language-modeling - masked-language-modeling paperswithcode_id: openwebtext dataset_info: config_name: plain_text features: - name: text dtype: string splits: - name: train num_bytes: 39769491688 num_examples: 8013769 download_size: 24193092408 dataset_size: 39769491688 configs: - config_name: plain_text data_files: - split: train path: plain_text/train-* default: true --- # Dataset Card for "openwebtext" ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [https://skylion007.github.io/OpenWebTextCorpus/](https://skylion007.github.io/OpenWebTextCorpus/) - **Repository:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **Paper:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **Point of Contact:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **Size of downloaded dataset files:** 13.51 GB - **Size of the generated dataset:** 41.70 GB - **Total amount of disk used:** 55.21 GB ### Dataset Summary An open-source replication of the WebText dataset from OpenAI, that was used to train GPT-2. This distribution was created by Aaron Gokaslan and Vanya Cohen of Brown University. ### Supported Tasks and Leaderboards [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Languages [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Dataset Structure ### Data Instances #### plain_text - **Size of downloaded dataset files:** 13.51 GB - **Size of the generated dataset:** 41.70 GB - **Total amount of disk used:** 55.21 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "text": "\"A magazine supplement with an image of Adolf Hitler and the title 'The Unreadable Book' is pictured in Berlin. No law bans “Mei..." } ``` ### Data Fields The data fields are the same among all splits. #### plain_text - `text`: a `string` feature. ### Data Splits | name | train | |------------|--------:| | plain_text | 8013769 | ## Dataset Creation ### Curation Rationale [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Source Data #### Initial Data Collection and Normalization The authors started by extracting all Reddit post urls from the Reddit submissions dataset. These links were deduplicated, filtered to exclude non-html content, and then shuffled randomly. The links were then distributed to several machines in parallel for download, and all web pages were extracted using the newspaper python package. Using Facebook FastText, non-English web pages were filtered out. Subsequently, near-duplicate documents were identified using local-sensitivity hashing (LSH). Documents were hashed into sets of 5-grams and all documents that had a similarity threshold of greater than 0.5 were removed. The the remaining documents were tokenized, and documents with fewer than 128 tokens were removed. This left 38GB of text data (40GB using SI units) from 8,013,769 documents. #### Who are the source language producers? [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Annotations The dataset doesn't contain annotations. ### Personal and Sensitive Information [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Discussion of Biases [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Other Known Limitations [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Additional Information ### Dataset Curators [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Licensing Information These data are released under this licensing scheme from the original authors ([source](https://skylion007.github.io/OpenWebTextCorpus/)): ``` We do not own any of the text from which these data has been extracted. We license the actual packaging of these parallel data under the [Creative Commons CC0 license (“no rights reserved”)](https://creativecommons.org/share-your-work/public-domain/cc0/) ``` #### Notice policy Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please: Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted. Clearly identify the copyrighted work claimed to be infringed. Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material. And contact us at the following email address: openwebtext at gmail.com and datasets at huggingface.co #### Take down policy The original authors will comply to legitimate requests by removing the affected sources from the next release of the corpus. Hugging Face will also update this repository accordingly. ### Citation Information ``` @misc{Gokaslan2019OpenWeb, title={OpenWebText Corpus}, author={Gokaslan, Aaron and Cohen, Vanya and Pavlick, Ellie and Tellex, Stefanie}, howpublished={\url{http://Skylion007.github.io/OpenWebTextCorpus}}, year={2019} } ``` ### Contributions Thanks to [@richarddwang](https://github.com/richarddwang) for adding this dataset.

annotations_creators: - 无标注 language_creators: - 现有公开资源采集 language: - 英语 license: - CC0 1.0 multilinguality: - 单语种 pretty_name: OpenWebText size_categories: - 100万 < 样本数量 < 1000万 source_datasets: - 原始数据集 task_categories: - 文本生成 - 掩码填充 task_ids: - 语言建模 - 掩码语言建模 paperswithcode_id: openwebtext dataset_info: config_name: 纯文本 features: - name: text dtype: 字符串 splits: - name: 训练集 num_bytes: 39769491688 num_examples: 8013769 download_size: 24193092408 dataset_size: 39769491688 configs: - config_name: 纯文本 data_files: - split: train path: 纯文本/train-* default: true --- # "OpenWebText" 数据集卡片 ## 目录 - [数据集概述](#dataset-description) - [数据集摘要](#dataset-summary) - [支持任务与基准榜单](#supported-tasks-and-leaderboards) - [使用语言](#languages) - [数据集结构](#dataset-structure) - [数据样例](#data-instances) - [数据字段说明](#data-fields) - [数据拆分](#data-splits) - [数据集构建流程](#dataset-creation) - [构建初衷](#curation-rationale) - [源数据](#source-data) - [标注信息](#annotations) - [个人与敏感信息](#personal-and-sensitive-information) - [数据集使用注意事项](#considerations-for-using-the-data) - [数据集的社会影响](#social-impact-of-dataset) - [偏差分析](#discussion-of-biases) - [其他已知局限性](#other-known-limitations) - [附加信息](#additional-information) - [数据集维护者](#dataset-curators) - [许可证信息](#licensing-information) - [引用信息](#citation-information) - [贡献者](#contributions) ## 数据集概述 - **主页**:[https://skylion007.github.io/OpenWebTextCorpus/](https://skylion007.github.io/OpenWebTextCorpus/) - **代码仓库**:[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **相关论文**:[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **联系方式**:[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **下载数据集文件大小**:13.51 GB - **解压后数据集大小**:41.70 GB - **总磁盘占用量**:55.21 GB ### 数据集摘要 本数据集是OpenAI用于训练GPT-2的WebText数据集的开源复刻版本。 本分发版本由布朗大学的Aaron Gokaslan与Vanya Cohen创建。 ### 支持任务与基准榜单 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 使用语言 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## 数据集结构 ### 数据样例 #### 纯文本配置 - **下载文件大小**:13.51 GB - **解压后数据集大小**:41.70 GB - **总磁盘占用量**:55.21 GB 训练集的一个样例如下(因内容过长已截断): { "text": ""A magazine supplement with an image of Adolf Hitler and the title 'The Unreadable Book' is pictured in Berlin. No law bans "Mei..." } ### 数据字段说明 所有数据拆分的字段结构均保持一致。 #### 纯文本配置 - `text`:字符串类型特征字段。 ### 数据拆分 | 配置名称 | 训练集样本数量 | |------------|---------------| | plain_text | 8013769 | ## 数据集构建流程 ### 构建初衷 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 源数据 #### 初始数据收集与归一化 研究人员首先从Reddit提交数据集中提取所有Reddit帖子的URL,随后对这些链接进行去重、过滤以排除非HTML内容,并进行随机打乱。接着将链接分发至多台机器并行下载,使用Python的`newspaper`库提取所有网页内容。通过Facebook FastText工具过滤掉非英语网页。 随后,使用局部敏感哈希(Local-Sensitivity Hashing,LSH)识别近似重复文档:将文档哈希为5-gram集合,移除所有相似度阈值大于0.5的文档。对剩余文档进行Token化处理,并移除Token数量少于128的文档。最终得到来自8013769份文档、总计38GB(采用国际单位制则为40GB)的文本数据。 #### 源语言生产者 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 标注信息 本数据集不包含任何标注内容。 ### 个人与敏感信息 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## 数据集使用注意事项 ### 数据集的社会影响 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 偏差分析 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 其他已知局限性 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## 附加信息 ### 数据集维护者 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 许可证信息 本数据集的原始作者采用以下许可协议发布([来源](https://skylion007.github.io/OpenWebTextCorpus/)): 我们并不拥有提取自这些数据的任何文本内容。 我们将本并行数据的实际打包形式以Creative Commons CC0协议(“无保留权利”)进行授权。 #### 通知政策 如您认为本数据集包含归您所有、不应在此处复现的内容,请: 1. 清晰表明您的身份,并提供详细的联系信息,如地址、电话号码或可联系到您的电子邮箱; 2. 明确标识您声称受到侵权的受版权保护作品; 3. 明确标识被声称侵权的材料,并提供足以让我们定位该材料的合理信息。 随后请通过以下邮箱联系我们:`openwebtext@gmail.com` 以及 `datasets@huggingface.co`。 #### 下架政策 原始作者将响应合法请求,在语料库的下一版本中移除受影响的源数据。Hugging Face也将同步更新本代码仓库。 ### 引用信息 @misc{Gokaslan2019OpenWeb, title={OpenWebText Corpus}, author={Gokaslan, Aaron and Cohen, Vanya and Pavlick, Ellie and Tellex, Stefanie}, howpublished={url{http://Skylion007.github.io/OpenWebTextCorpus}}, year={2019} } ### 贡献者 感谢[@richarddwang](https://github.com/richarddwang)为本数据集的添加工作。
提供机构:
GuojiXu
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作