fzkuji/pg19

Name: fzkuji/pg19
Creator: fzkuji
Published: 2024-04-26 15:15:41
License: 暂无描述

Hugging Face2024-04-26 更新2024-06-12 收录

下载链接：

https://hf-mirror.com/datasets/fzkuji/pg19

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - expert-generated language_creators: - expert-generated language: - en license: - apache-2.0 multilinguality: - monolingual size_categories: - 10K<n<100K source_datasets: - original task_categories: - text-generation task_ids: - language-modeling paperswithcode_id: pg-19 pretty_name: PG-19 dataset_info: features: - name: short_book_title dtype: string - name: publication_date dtype: int32 - name: url dtype: string - name: text dtype: string splits: - name: train num_bytes: 11453688452 num_examples: 28602 - name: validation num_bytes: 17402295 num_examples: 50 - name: test num_bytes: 40482852 num_examples: 100 download_size: 11740397875 dataset_size: 11511573599 --- # Dataset Card for "pg19" ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [https://github.com/deepmind/pg19](https://github.com/deepmind/pg19) - **Repository:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **Paper:** [Compressive Transformers for Long-Range Sequence Modelling](https://arxiv.org/abs/1911.05507) - **Point of Contact:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **Size of downloaded dataset files:** 11.74 GB - **Size of the generated dataset:** 11.51 GB - **Total amount of disk used:** 23.25 GB ### Dataset Summary This repository contains the PG-19 language modeling benchmark. It includes a set of books extracted from the Project Gutenberg books library, that were published before 1919. It also contains metadata of book titles and publication dates. PG-19 is over double the size of the Billion Word benchmark and contains documents that are 20X longer, on average, than the WikiText long-range language modelling benchmark. Books are partitioned into a train, validation, and test set. Book metadata is stored in metadata.csv which contains (book_id, short_book_title, publication_date). Unlike prior benchmarks, we do not constrain the vocabulary size --- i.e. mapping rare words to an UNK token --- but instead release the data as an open-vocabulary benchmark. The only processing of the text that has been applied is the removal of boilerplate license text, and the mapping of offensive discriminatory words as specified by Ofcom to placeholder tokens. Users are free to model the data at the character-level, subword-level, or via any mechanism that can model an arbitrary string of text. To compare models we propose to continue measuring the word-level perplexity, by calculating the total likelihood of the dataset (via any chosen subword vocabulary or character-based scheme) divided by the number of tokens --- specified below in the dataset statistics table. One could use this dataset for benchmarking long-range language models, or use it to pre-train for other natural language processing tasks which require long-range reasoning, such as LAMBADA or NarrativeQA. We would not recommend using this dataset to train a general-purpose language model, e.g. for applications to a production-system dialogue agent, due to the dated linguistic style of old texts and the inherent biases present in historical writing. ### Supported Tasks and Leaderboards [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Languages [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Dataset Structure ### Data Instances #### default - **Size of downloaded dataset files:** 11.74 GB - **Size of the generated dataset:** 11.51 GB - **Total amount of disk used:** 23.25 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "publication_date": 1907, "short_book_title": "La Fiammetta by Giovanni Boccaccio", "text": "\"\\n\\n\\n\\nProduced by Ted Garvin, Dave Morgan and PG Distributed Proofreaders\\n\\n\\n\\n\\nLA FIAMMETTA\\n\\nBY\\n\\nGIOVANNI BOCCACCIO\\n...", "url": "http://www.gutenberg.org/ebooks/10006" } ``` ### Data Fields The data fields are the same among all splits. #### default - `short_book_title`: a `string` feature. - `publication_date`: a `int32` feature. - `url`: a `string` feature. - `text`: a `string` feature. ### Data Splits | name |train|validation|test| |-------|----:|---------:|---:| |default|28602| 50| 100| ## Dataset Creation ### Curation Rationale [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Source Data #### Initial Data Collection and Normalization [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### Who are the source language producers? [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Annotations #### Annotation process [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### Who are the annotators? [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Personal and Sensitive Information [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Discussion of Biases [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Other Known Limitations [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Additional Information ### Dataset Curators [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Licensing Information The dataset is licensed under [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0.html). ### Citation Information ``` @article{raecompressive2019, author = {Rae, Jack W and Potapenko, Anna and Jayakumar, Siddhant M and Hillier, Chloe and Lillicrap, Timothy P}, title = {Compressive Transformers for Long-Range Sequence Modelling}, journal = {arXiv preprint}, url = {https://arxiv.org/abs/1911.05507}, year = {2019}, } ``` ### Contributions Thanks to [@thomwolf](https://github.com/thomwolf), [@lewtun](https://github.com/lewtun), [@lucidrains](https://github.com/lucidrains), [@lhoestq](https://github.com/lhoestq) for adding this dataset.

--- 注释生成者: - 专家生成语言生成者: - 专家生成语言: - 英语许可证: - Apache-2.0 多语言类型: - 单语言规模类别: - 10K<n<100K 源数据集: - 原创任务类别: - 文本生成任务子类别: - 语言建模（language modeling） PapersWithCode ID: pg-19 友好名称: PG-19 数据集信息: 特征: - 名称: short_book_title（书籍短标题）, 类型: 字符串 - 名称: publication_date（出版日期）, 类型: int32（32位整数） - 名称: url（链接）, 类型: 字符串 - 名称: text（文本）, 类型: 字符串划分集: - 名称: 训练集（train）, 字节数: 11453688452, 样本数: 28602 - 名称: 验证集（validation）, 字节数: 17402295, 样本数: 50 - 名称: 测试集（test）, 字节数: 40482852, 样本数: 100 下载大小: 11740397875 数据集总大小: 11511573599 --- # PG-19 数据集卡片 ## 目录 - [数据集描述](#数据集描述) - [数据集概述](#数据集概述) - [支持任务与评测基准](#支持任务与评测基准) - [语言](#语言) - [数据集结构](#数据集结构) - [数据实例](#数据实例) - [数据字段](#数据字段) - [数据划分](#数据划分) - [数据集构建](#数据集构建) - [数据集整理依据](#数据集整理依据) - [源数据](#源数据) - [注释](#注释) - [个人与敏感信息](#个人与敏感信息) - [数据使用注意事项](#数据使用注意事项) - [数据集的社会影响](#数据集的社会影响) - [偏见讨论](#偏见讨论) - [其他已知局限性](#其他已知局限性) - [附加信息](#附加信息) - [数据集整理者](#数据集整理者) - [许可信息](#许可信息) - [引用信息](#引用信息) - [贡献](#贡献) ## 数据集描述 - **主页**: [https://github.com/deepmind/pg19](https://github.com/deepmind/pg19) - **代码仓库**: [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **相关论文**: [面向长序列建模的压缩Transformer（Compressive Transformers for Long-Range Sequence Modelling）](https://arxiv.org/abs/1911.05507) - **联系人**: [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **下载数据集文件大小**: 11.74 GB - **生成后数据集大小**: 11.51 GB - **总磁盘占用量**: 23.25 GB ### 数据集概述本仓库包含PG-19语言建模（language modeling）基准数据集。数据集收录了从Project Gutenberg图书库中提取的1919年之前出版的图书，同时附带图书标题与出版日期的元数据。 PG-19的规模是Billion Word基准集的两倍以上，且其包含的文档平均长度是WikiText长程语言建模范式基准集的20倍。图书被划分为训练集、验证集与测试集。图书元数据存储于metadata.csv中，包含（book_id, short_book_title, publication_date）三个字段。与此前的基准集不同，本数据集未对词汇表规模进行限制——即未将稀有词映射至未知词（UNK）标记，而是以开放词汇表的形式发布。仅对文本进行了两项处理：移除冗余的许可协议文本，以及将英国通信办公室（Ofcom）指定的冒犯性歧视性词语替换为占位符标记。使用者可自由选择字符级、子词级建模，或任意可处理任意文本序列的建模方式。为对比模型性能，我们建议沿用词级困惑度（perplexity）的评测方式：即通过任意选定的子词词汇表或基于字符的方案，计算数据集的总似然值，再除以标记（Token）总数——具体统计值详见下文的数据集统计表。本数据集可用于长程语言模型的基准测试，或用于预训练其他需要长程推理能力的自然语言处理任务，例如LAMBADA与NarrativeQA。我们不建议使用本数据集训练通用大语言模型（Large Language Model），例如用于生产级AI智能体（AI Agent）的通用模型，原因在于旧文本的语言风格过时，且历史写作中存在固有偏见。 ### 支持任务与评测基准 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 语言 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## 数据集结构 ### 数据实例 #### 默认配置 - **下载数据集文件大小**: 11.74 GB - **生成后数据集大小**: 11.51 GB - **总磁盘占用量**: 23.25 GB 以下是训练集的一个示例（因内容过长已裁剪）：本示例因过长已被裁剪： { "publication_date": 1907, "short_book_title": "La Fiammetta by Giovanni Boccaccio", "text": ""\n\n\n\nProduced by Ted Garvin, Dave Morgan and PG Distributed Proofreaders\n\n\n\n\LA FIAMMETTA\n\nBY\n\nGIOVANNI BOCCACCIO\n...", "url": "http://www.gutenberg.org/ebooks/10006" } ### 数据字段所有划分集的数据字段均保持一致。 #### 默认配置 - `short_book_title`: 字符串类型特征 - `publication_date`: 32位整数（int32）类型特征 - `url`: 字符串类型特征 - `text`: 字符串类型特征 ### 数据划分 | 划分名称 | 训练集样本数 | 验证集样本数 | 测试集样本数 | |-------|----:|---------:|---:| | 默认配置 | 28602 | 50 | 100 | ## 数据集构建 ### 数据集整理依据 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 源数据 #### 初始数据收集与归一化 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### 源语言创作者 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 注释 #### 注释流程 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### 注释者 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 个人与敏感信息 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## 数据使用注意事项 ### 数据集的社会影响 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 偏见讨论 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 其他已知局限性 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## 附加信息 ### 数据集整理者 [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### 许可信息本数据集采用[Apache许可证，版本2.0](https://www.apache.org/licenses/LICENSE-2.0.html)进行许可。 ### 引用信息 @article{raecompressive2019, author = {Rae, Jack W and Potapenko, Anna and Jayakumar, Siddhant M and Hillier, Chloe and Lillicrap, Timothy P}, title = {Compressive Transformers for Long-Range Sequence Modelling}, journal = {arXiv preprint}, url = {https://arxiv.org/abs/1911.05507}, year = {2019}, } ### 贡献感谢[@thomwolf](https://github.com/thomwolf)、[@lewtun](https://github.com/lewtun)、[@lucidrains](https://github.com/lucidrains)、[@lhoestq](https://github.com/lhoestq)为本数据集的添加所做的贡献。

提供机构：

fzkuji

原始信息汇总

数据集概述

基本信息

名称: PG-19
语言: 英语
许可证: Apache-2.0
多语言性: 单语种
数据来源: 原始数据
任务类别: 文本生成
任务ID: 语言建模

数据集大小

下载大小: 11.74 GB
数据集大小: 11.51 GB

数据集结构

特征:
- short_book_title: 字符串类型
- publication_date: 整数类型
- url: 字符串类型
- text: 字符串类型
数据分割:
- 训练集: 28602个样本
- 验证集: 50个样本
- 测试集: 100个样本

数据集创建

注释创建者: 专家生成
语言创建者: 专家生成

使用考虑

不推荐用于训练通用目的的语言模型，如生产系统对话代理，由于旧文本的语言风格和历史写作中固有的偏见。

附加信息

引用信息:

@article{raecompressive2019, author = {Rae, Jack W and Potapenko, Anna and Jayakumar, Siddhant M and Hillier, Chloe and Lillicrap, Timothy P}, title = {Compressive Transformers for Long-Range Sequence Modelling}, journal = {arXiv preprint}, url = {https://arxiv.org/abs/1911.05507}, year = {2019}, }
贡献者: @thomwolf, @lewtun, @lucidrains, @lhoestq

搜集汇总

数据集介绍

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集