defunct-datasets/the_pile_openwebtext2

Name: defunct-datasets/the_pile_openwebtext2
Creator: defunct-datasets
Published: 2023-11-27 14:54:23
License: 暂无描述

Hugging Face2023-11-27 更新2024-06-15 收录

下载链接：

https://hf-mirror.com/datasets/defunct-datasets/the_pile_openwebtext2

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - no-annotation language_creators: - found language: - en license: - mit multilinguality: - monolingual pretty_name: OpenWebText2 size_categories: - 10M<n<100M source_datasets: - original task_categories: - text-generation - fill-mask - text-classification task_ids: - language-modeling - masked-language-modeling - text-scoring dataset_info: features: - name: title dtype: string - name: text dtype: string config_name: plain_text splits: - name: train num_bytes: 68571017395 num_examples: 17103059 download_size: 29344276480 dataset_size: 68571017395 viewer: false --- # Dataset Card for the_pile_openwebtext2 ## Table of Contents - [Dataset Card for the_pile_openwebtext2](#dataset-card-for-the_pile_openwebtext2) - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [|split|num examples|](#splitnum-examples) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Initial Data Collection and Normalization](#initial-data-collection-and-normalization) - [Who are the source language producers?](#who-are-the-source-language-producers) - [Annotations](#annotations) - [Annotation process](#annotation-process) - [Who are the annotators?](#who-are-the-annotators) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** https://openwebtext2.readthedocs.io/en/latest/ - **Repository:** [GitHub](https://github.com/EleutherAI/openwebtext2) - **Paper:** https://arxiv.org/abs/2101.00027 - **Leaderboard:** [Needs More Information] - **Point of Contact:** [Needs More Information] ### Dataset Summary <div class="course-tip course-tip-orange bg-gradient-to-br dark:bg-gradient-to-r before:border-orange-500 dark:before:border-orange-800 from-orange-50 dark:from-gray-900 to-white dark:to-gray-950 border border-orange-50 text-orange-700 dark:text-gray-400"> Defunct: Dataset "the_pile_openwebtext2" is defunct and no longer accessible due to unavailability of the source data. </div> OpenWebText2 is part of EleutherAi/The Pile dataset and is an enhanced version of the original OpenWebTextCorpus covering all Reddit submissions from 2005 up until April 2020, with further months becoming available after the corresponding PushShift dump files are released. |download_size|27.3 Gib| |dataset_size|63.8 Gib| ### Supported Tasks and Leaderboards This dataset is used for Language Modeling. ### Languages This dataset is in English. ## Dataset Structure ### Data Instances ``` This example was too long and was cropped: {'title': Xiaomi Mi Note 10 Gearbest Coupon Promo Code [6+128GB] [France Warehouse], 'text': '27% off Xiaomi Mi Note 10 (CC9 Pro) 108MP Penta Camera Mobile Phone Global Version Online Smartphone – Black Gearbest Coupon Promo Code\n\nGearbest Coupon Price :$439.99\n\nRegular Price : $603.19 Your Save : $163.20 Coupon Limit: 100 times Warehouse: France Expires : September 30, 2020 Coupon Valid for...', 'reddit_scores': [6],} ``` ### Data Fields - `title`: title of the web page - `text`: text content of the web page - `reddit_scores`: scores of the reddit submissions that mention this web page, as a list of integers ### Data Splits |split|num examples| -------------------------------- |train|17103059| ## Dataset Creation ### Curation Rationale [Needs More Information] ### Source Data #### Initial Data Collection and Normalization [Needs More Information] #### Who are the source language producers? [Needs More Information] ### Annotations #### Annotation process [Needs More Information] #### Who are the annotators? [Needs More Information] ### Personal and Sensitive Information [Needs More Information] ## Considerations for Using the Data ### Social Impact of Dataset [Needs More Information] ### Discussion of Biases [Needs More Information] ### Other Known Limitations [Needs More Information] ## Additional Information ### Dataset Curators [Needs More Information] ### Licensing Information [Needs More Information] ### Citation Information ``` @article{pile, title={The {P}ile: An 800GB Dataset of Diverse Text for Language Modeling}, author={Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and Presser, Shawn and Leahy, Connor}, journal={arXiv preprint arXiv:2101.00027}, year={2020} } ``` ### Contributions [researcher2](https://github.com/researcher2) Wrote much of this code, with inspiration and some straight copying of the scraping code found [here](https://github.com/yet-another-account/openwebtext/). [sdtblck](https://github.com/sdtblck/) kindly put together the Colab notebook, and performed a chunk of the scraping. [leogao2](https://github.com/leogao2/) provided overall design guidance, lm_dataformat, and performed another chunk of scraping. [Colaboratory](https://colab.research.google.com/) VMs helped with about 10% of our overall scraping. [The Eye](http://the-eye.eu/) host the processed datasets. [Read The Docs](https://readthedocs.org/) host our documentation. [@richarddwang](https://github.com/richarddwang) added this dataset to HF/datasets.

提供机构：

defunct-datasets

原始信息汇总

数据集概述

基本信息

数据集名称: OpenWebText2
语言: 英语
许可证: MIT
多语言性: 单语种
大小类别: 10M<n<100M
源数据: 原始数据
任务类别:
- 文本生成
- 填充掩码
- 文本分类
任务ID:
- 语言建模
- 掩码语言建模
- 文本评分

数据结构

特征:
- title: 字符串类型
- text: 字符串类型
配置名称: plain_text
分割:
- train:
  - 字节数: 68571017395
  - 样本数: 17103059
下载大小: 29344276480
数据集大小: 68571017395

数据实例

{ title: Xiaomi Mi Note 10 Gearbest Coupon Promo Code [6+128GB] [France Warehouse], text: 27% off Xiaomi Mi Note 10 (CC9 Pro) 108MP Penta Camera Mobile Phone Global Version Online Smartphone – Black Gearbest Coupon Promo Code

Gearbest Coupon Price :$439.99

Regular Price : $603.19 Your Save : $163.20 Coupon Limit: 100 times Warehouse: France Expires : September 30, 2020 Coupon Valid for..., reddit_scores: [6] }

数据字段

title: 网页标题
text: 网页文本内容
reddit_scores: Reddit提交的分数，以整数列表形式表示

数据分割

分割	样本数
train	17103059

5,000+

优质数据集

54 个

任务类型

进入经典数据集