fzkuji/pg19
收藏Hugging Face2024-04-26 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/fzkuji/pg19
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- expert-generated
language_creators:
- expert-generated
language:
- en
license:
- apache-2.0
multilinguality:
- monolingual
size_categories:
- 10K<n<100K
source_datasets:
- original
task_categories:
- text-generation
task_ids:
- language-modeling
paperswithcode_id: pg-19
pretty_name: PG-19
dataset_info:
features:
- name: short_book_title
dtype: string
- name: publication_date
dtype: int32
- name: url
dtype: string
- name: text
dtype: string
splits:
- name: train
num_bytes: 11453688452
num_examples: 28602
- name: validation
num_bytes: 17402295
num_examples: 50
- name: test
num_bytes: 40482852
num_examples: 100
download_size: 11740397875
dataset_size: 11511573599
---
# Dataset Card for "pg19"
## Table of Contents
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
- [Annotations](#annotations)
- [Personal and Sensitive Information](#personal-and-sensitive-information)
- [Considerations for Using the Data](#considerations-for-using-the-data)
- [Social Impact of Dataset](#social-impact-of-dataset)
- [Discussion of Biases](#discussion-of-biases)
- [Other Known Limitations](#other-known-limitations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:** [https://github.com/deepmind/pg19](https://github.com/deepmind/pg19)
- **Repository:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
- **Paper:** [Compressive Transformers for Long-Range Sequence Modelling](https://arxiv.org/abs/1911.05507)
- **Point of Contact:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
- **Size of downloaded dataset files:** 11.74 GB
- **Size of the generated dataset:** 11.51 GB
- **Total amount of disk used:** 23.25 GB
### Dataset Summary
This repository contains the PG-19 language modeling benchmark.
It includes a set of books extracted from the Project Gutenberg books library, that were published before 1919.
It also contains metadata of book titles and publication dates.
PG-19 is over double the size of the Billion Word benchmark and contains documents that are 20X longer, on average, than the WikiText long-range language modelling benchmark.
Books are partitioned into a train, validation, and test set. Book metadata is stored in metadata.csv which contains (book_id, short_book_title, publication_date).
Unlike prior benchmarks, we do not constrain the vocabulary size --- i.e. mapping rare words to an UNK token --- but instead release the data as an open-vocabulary benchmark. The only processing of the text that has been applied is the removal of boilerplate license text, and the mapping of offensive discriminatory words as specified by Ofcom to placeholder tokens. Users are free to model the data at the character-level, subword-level, or via any mechanism that can model an arbitrary string of text.
To compare models we propose to continue measuring the word-level perplexity, by calculating the total likelihood of the dataset (via any chosen subword vocabulary or character-based scheme) divided by the number of tokens --- specified below in the dataset statistics table.
One could use this dataset for benchmarking long-range language models, or use it to pre-train for other natural language processing tasks which require long-range reasoning, such as LAMBADA or NarrativeQA. We would not recommend using this dataset to train a general-purpose language model, e.g. for applications to a production-system dialogue agent, due to the dated linguistic style of old texts and the inherent biases present in historical writing.
### Supported Tasks and Leaderboards
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Languages
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
## Dataset Structure
### Data Instances
#### default
- **Size of downloaded dataset files:** 11.74 GB
- **Size of the generated dataset:** 11.51 GB
- **Total amount of disk used:** 23.25 GB
An example of 'train' looks as follows.
```
This example was too long and was cropped:
{
"publication_date": 1907,
"short_book_title": "La Fiammetta by Giovanni Boccaccio",
"text": "\"\\n\\n\\n\\nProduced by Ted Garvin, Dave Morgan and PG Distributed Proofreaders\\n\\n\\n\\n\\nLA FIAMMETTA\\n\\nBY\\n\\nGIOVANNI BOCCACCIO\\n...",
"url": "http://www.gutenberg.org/ebooks/10006"
}
```
### Data Fields
The data fields are the same among all splits.
#### default
- `short_book_title`: a `string` feature.
- `publication_date`: a `int32` feature.
- `url`: a `string` feature.
- `text`: a `string` feature.
### Data Splits
| name |train|validation|test|
|-------|----:|---------:|---:|
|default|28602| 50| 100|
## Dataset Creation
### Curation Rationale
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Source Data
#### Initial Data Collection and Normalization
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
#### Who are the source language producers?
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Annotations
#### Annotation process
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
#### Who are the annotators?
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Personal and Sensitive Information
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
## Considerations for Using the Data
### Social Impact of Dataset
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Discussion of Biases
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Other Known Limitations
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
## Additional Information
### Dataset Curators
[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### Licensing Information
The dataset is licensed under [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0.html).
### Citation Information
```
@article{raecompressive2019,
author = {Rae, Jack W and Potapenko, Anna and Jayakumar, Siddhant M and
Hillier, Chloe and Lillicrap, Timothy P},
title = {Compressive Transformers for Long-Range Sequence Modelling},
journal = {arXiv preprint},
url = {https://arxiv.org/abs/1911.05507},
year = {2019},
}
```
### Contributions
Thanks to [@thomwolf](https://github.com/thomwolf), [@lewtun](https://github.com/lewtun), [@lucidrains](https://github.com/lucidrains), [@lhoestq](https://github.com/lhoestq) for adding this dataset.
---
注释生成者:
- 专家生成
语言生成者:
- 专家生成
语言:
- 英语
许可证:
- Apache-2.0
多语言类型:
- 单语言
规模类别:
- 10K<n<100K
源数据集:
- 原创
任务类别:
- 文本生成
任务子类别:
- 语言建模(language modeling)
PapersWithCode ID: pg-19
友好名称: PG-19
数据集信息:
特征:
- 名称: short_book_title(书籍短标题), 类型: 字符串
- 名称: publication_date(出版日期), 类型: int32(32位整数)
- 名称: url(链接), 类型: 字符串
- 名称: text(文本), 类型: 字符串
划分集:
- 名称: 训练集(train), 字节数: 11453688452, 样本数: 28602
- 名称: 验证集(validation), 字节数: 17402295, 样本数: 50
- 名称: 测试集(test), 字节数: 40482852, 样本数: 100
下载大小: 11740397875
数据集总大小: 11511573599
---
# PG-19 数据集卡片
## 目录
- [数据集描述](#数据集描述)
- [数据集概述](#数据集概述)
- [支持任务与评测基准](#支持任务与评测基准)
- [语言](#语言)
- [数据集结构](#数据集结构)
- [数据实例](#数据实例)
- [数据字段](#数据字段)
- [数据划分](#数据划分)
- [数据集构建](#数据集构建)
- [数据集整理依据](#数据集整理依据)
- [源数据](#源数据)
- [注释](#注释)
- [个人与敏感信息](#个人与敏感信息)
- [数据使用注意事项](#数据使用注意事项)
- [数据集的社会影响](#数据集的社会影响)
- [偏见讨论](#偏见讨论)
- [其他已知局限性](#其他已知局限性)
- [附加信息](#附加信息)
- [数据集整理者](#数据集整理者)
- [许可信息](#许可信息)
- [引用信息](#引用信息)
- [贡献](#贡献)
## 数据集描述
- **主页**: [https://github.com/deepmind/pg19](https://github.com/deepmind/pg19)
- **代码仓库**: [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
- **相关论文**: [面向长序列建模的压缩Transformer(Compressive Transformers for Long-Range Sequence Modelling)](https://arxiv.org/abs/1911.05507)
- **联系人**: [更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
- **下载数据集文件大小**: 11.74 GB
- **生成后数据集大小**: 11.51 GB
- **总磁盘占用量**: 23.25 GB
### 数据集概述
本仓库包含PG-19语言建模(language modeling)基准数据集。数据集收录了从Project Gutenberg图书库中提取的1919年之前出版的图书,同时附带图书标题与出版日期的元数据。
PG-19的规模是Billion Word基准集的两倍以上,且其包含的文档平均长度是WikiText长程语言建模范式基准集的20倍。
图书被划分为训练集、验证集与测试集。图书元数据存储于metadata.csv中,包含(book_id, short_book_title, publication_date)三个字段。
与此前的基准集不同,本数据集未对词汇表规模进行限制——即未将稀有词映射至未知词(UNK)标记,而是以开放词汇表的形式发布。仅对文本进行了两项处理:移除冗余的许可协议文本,以及将英国通信办公室(Ofcom)指定的冒犯性歧视性词语替换为占位符标记。使用者可自由选择字符级、子词级建模,或任意可处理任意文本序列的建模方式。
为对比模型性能,我们建议沿用词级困惑度(perplexity)的评测方式:即通过任意选定的子词词汇表或基于字符的方案,计算数据集的总似然值,再除以标记(Token)总数——具体统计值详见下文的数据集统计表。
本数据集可用于长程语言模型的基准测试,或用于预训练其他需要长程推理能力的自然语言处理任务,例如LAMBADA与NarrativeQA。我们不建议使用本数据集训练通用大语言模型(Large Language Model),例如用于生产级AI智能体(AI Agent)的通用模型,原因在于旧文本的语言风格过时,且历史写作中存在固有偏见。
### 支持任务与评测基准
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 语言
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
## 数据集结构
### 数据实例
#### 默认配置
- **下载数据集文件大小**: 11.74 GB
- **生成后数据集大小**: 11.51 GB
- **总磁盘占用量**: 23.25 GB
以下是训练集的一个示例(因内容过长已裁剪):
本示例因过长已被裁剪:
{
"publication_date": 1907,
"short_book_title": "La Fiammetta by Giovanni Boccaccio",
"text": ""\n\n\n\nProduced by Ted Garvin, Dave Morgan and PG Distributed Proofreaders\n\n\n\n\LA FIAMMETTA\n\nBY\n\nGIOVANNI BOCCACCIO\n...",
"url": "http://www.gutenberg.org/ebooks/10006"
}
### 数据字段
所有划分集的数据字段均保持一致。
#### 默认配置
- `short_book_title`: 字符串类型特征
- `publication_date`: 32位整数(int32)类型特征
- `url`: 字符串类型特征
- `text`: 字符串类型特征
### 数据划分
| 划分名称 | 训练集样本数 | 验证集样本数 | 测试集样本数 |
|-------|----:|---------:|---:|
| 默认配置 | 28602 | 50 | 100 |
## 数据集构建
### 数据集整理依据
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 源数据
#### 初始数据收集与归一化
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
#### 源语言创作者
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 注释
#### 注释流程
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
#### 注释者
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 个人与敏感信息
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
## 数据使用注意事项
### 数据集的社会影响
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 偏见讨论
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 其他已知局限性
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
## 附加信息
### 数据集整理者
[更多信息待补充](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
### 许可信息
本数据集采用[Apache许可证,版本2.0](https://www.apache.org/licenses/LICENSE-2.0.html)进行许可。
### 引用信息
@article{raecompressive2019,
author = {Rae, Jack W and Potapenko, Anna and Jayakumar, Siddhant M and
Hillier, Chloe and Lillicrap, Timothy P},
title = {Compressive Transformers for Long-Range Sequence Modelling},
journal = {arXiv preprint},
url = {https://arxiv.org/abs/1911.05507},
year = {2019},
}
### 贡献
感谢[@thomwolf](https://github.com/thomwolf)、[@lewtun](https://github.com/lewtun)、[@lucidrains](https://github.com/lucidrains)、[@lhoestq](https://github.com/lhoestq)为本数据集的添加所做的贡献。
提供机构:
fzkuji
原始信息汇总
数据集概述
基本信息
- 名称: PG-19
- 语言: 英语
- 许可证: Apache-2.0
- 多语言性: 单语种
- 数据来源: 原始数据
- 任务类别: 文本生成
- 任务ID: 语言建模
数据集大小
- 下载大小: 11.74 GB
- 数据集大小: 11.51 GB
数据集结构
- 特征:
short_book_title: 字符串类型publication_date: 整数类型url: 字符串类型text: 字符串类型
- 数据分割:
- 训练集: 28602个样本
- 验证集: 50个样本
- 测试集: 100个样本
数据集创建
- 注释创建者: 专家生成
- 语言创建者: 专家生成
使用考虑
- 不推荐用于训练通用目的的语言模型,如生产系统对话代理,由于旧文本的语言风格和历史写作中固有的偏见。
附加信息
-
引用信息:
@article{raecompressive2019, author = {Rae, Jack W and Potapenko, Anna and Jayakumar, Siddhant M and Hillier, Chloe and Lillicrap, Timothy P}, title = {Compressive Transformers for Long-Range Sequence Modelling}, journal = {arXiv preprint}, url = {https://arxiv.org/abs/1911.05507}, year = {2019}, }
-
贡献者: @thomwolf, @lewtun, @lucidrains, @lhoestq
搜集汇总
数据集介绍

以上内容由遇见数据集搜集并总结生成



