deepmind/pg19|语言建模数据集|长范围序列建模数据集

hugging_face2024-01-18 更新2024-05-25 收录

语言建模

长范围序列建模

下载链接：

https://hf-mirror.com/datasets/deepmind/pg19

下载链接

链接失效反馈

资源简介：

--- annotations_creators: - expert-generated language_creators: - expert-generated language: - en license: - apache-2.0 multilinguality: - monolingual size_categories: - 10K<n<100K source_datasets: - original task_categories: - text-generation task_ids: - language-modeling paperswithcode_id: pg-19 pretty_name: PG-19 dataset_info: features: - name: short_book_title dtype: string - name: publication_date dtype: int32 - name: url dtype: string - name: text dtype: string splits: - name: train num_bytes: 11453688452 num_examples: 28602 - name: validation num_bytes: 17402295 num_examples: 50 - name: test num_bytes: 40482852 num_examples: 100 download_size: 11740397875 dataset_size: 11511573599 --- # Dataset Card for "pg19" ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [https://github.com/deepmind/pg19](https://github.com/deepmind/pg19) - **Repository:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **Paper:** [Compressive Transformers for Long-Range Sequence Modelling](https://arxiv.org/abs/1911.05507) - **Point of Contact:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) - **Size of downloaded dataset files:** 11.74 GB - **Size of the generated dataset:** 11.51 GB - **Total amount of disk used:** 23.25 GB ### Dataset Summary This repository contains the PG-19 language modeling benchmark. It includes a set of books extracted from the Project Gutenberg books library, that were published before 1919. It also contains metadata of book titles and publication dates. PG-19 is over double the size of the Billion Word benchmark and contains documents that are 20X longer, on average, than the WikiText long-range language modelling benchmark. Books are partitioned into a train, validation, and test set. Book metadata is stored in metadata.csv which contains (book_id, short_book_title, publication_date). Unlike prior benchmarks, we do not constrain the vocabulary size --- i.e. mapping rare words to an UNK token --- but instead release the data as an open-vocabulary benchmark. The only processing of the text that has been applied is the removal of boilerplate license text, and the mapping of offensive discriminatory words as specified by Ofcom to placeholder tokens. Users are free to model the data at the character-level, subword-level, or via any mechanism that can model an arbitrary string of text. To compare models we propose to continue measuring the word-level perplexity, by calculating the total likelihood of the dataset (via any chosen subword vocabulary or character-based scheme) divided by the number of tokens --- specified below in the dataset statistics table. One could use this dataset for benchmarking long-range language models, or use it to pre-train for other natural language processing tasks which require long-range reasoning, such as LAMBADA or NarrativeQA. We would not recommend using this dataset to train a general-purpose language model, e.g. for applications to a production-system dialogue agent, due to the dated linguistic style of old texts and the inherent biases present in historical writing. ### Supported Tasks and Leaderboards [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Languages [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Dataset Structure ### Data Instances #### default - **Size of downloaded dataset files:** 11.74 GB - **Size of the generated dataset:** 11.51 GB - **Total amount of disk used:** 23.25 GB An example of 'train' looks as follows. ``` This example was too long and was cropped: { "publication_date": 1907, "short_book_title": "La Fiammetta by Giovanni Boccaccio", "text": "\"\\n\\n\\n\\nProduced by Ted Garvin, Dave Morgan and PG Distributed Proofreaders\\n\\n\\n\\n\\nLA FIAMMETTA\\n\\nBY\\n\\nGIOVANNI BOCCACCIO\\n...", "url": "http://www.gutenberg.org/ebooks/10006" } ``` ### Data Fields The data fields are the same among all splits. #### default - `short_book_title`: a `string` feature. - `publication_date`: a `int32` feature. - `url`: a `string` feature. - `text`: a `string` feature. ### Data Splits | name |train|validation|test| |-------|----:|---------:|---:| |default|28602| 50| 100| ## Dataset Creation ### Curation Rationale [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Source Data #### Initial Data Collection and Normalization [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### Who are the source language producers? [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Annotations #### Annotation process [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) #### Who are the annotators? [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Personal and Sensitive Information [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Discussion of Biases [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Other Known Limitations [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ## Additional Information ### Dataset Curators [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) ### Licensing Information The dataset is licensed under [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0.html). ### Citation Information ``` @article{raecompressive2019, author = {Rae, Jack W and Potapenko, Anna and Jayakumar, Siddhant M and Hillier, Chloe and Lillicrap, Timothy P}, title = {Compressive Transformers for Long-Range Sequence Modelling}, journal = {arXiv preprint}, url = {https://arxiv.org/abs/1911.05507}, year = {2019}, } ``` ### Contributions Thanks to [@thomwolf](https://github.com/thomwolf), [@lewtun](https://github.com/lewtun), [@lucidrains](https://github.com/lucidrains), [@lhoestq](https://github.com/lhoestq) for adding this dataset.

提供机构：

deepmind

原始信息汇总

数据集概述

数据集名称： PG-19

数据集描述： PG-19是一个语言建模基准数据集，包含从Project Gutenberg图书库中提取的出版于1919年之前的书籍。数据集包含书籍标题和出版日期等元数据。PG-19的规模是Billion Word基准的两倍以上，平均文档长度是WikiText长距离语言建模基准的20倍。

数据集特点：

语言： 英语（en）
许可证： Apache-2.0
多语言性： 单语（monolingual）
大小分类： 10K<n<100K
源数据集： 原始（original）
任务类别： 文本生成（text-generation）
任务ID： 语言建模（language-modeling）
数据集信息：
- 特征：
  - short_book_title: 字符串类型
  - publication_date: 整数类型（int32）
  - url: 字符串类型
  - text: 字符串类型
- 数据分割：
  - 训练集（train）：28602个样本，11453688452字节
  - 验证集（validation）：50个样本，17402295字节
  - 测试集（test）：100个样本，40482852字节
- 下载大小： 11.74 GB
- 数据集大小： 11.51 GB

数据集用途： 该数据集适用于基准测试长距离语言模型，或用于预训练其他需要长距离推理的自然语言处理任务，如LAMBADA或NarrativeQA。不推荐用于训练通用目的的语言模型，如生产系统对话代理，由于旧文本的语言风格和历史写作中固有的偏见。

许可证信息： 数据集根据Apache License, Version 2.0授权。

引用信息：

@article{raecompressive2019, author = {Rae, Jack W and Potapenko, Anna and Jayakumar, Siddhant M and Hillier, Chloe and Lillicrap, Timothy P}, title = {Compressive Transformers for Long-Range Sequence Modelling}, journal = {arXiv preprint}, url = {https://arxiv.org/abs/1911.05507}, year = {2019}, }

AI搜集汇总

数据集介绍

构建方式

PG-19数据集的构建基于Project Gutenberg图书馆中出版于1919年之前的书籍，这些书籍经过精心挑选并去除了版权声明等冗余内容。数据集的构建过程中，保留了书籍的原始文本，同时添加了书籍的简短标题、出版日期和URL等元数据。为了确保数据集的多样性和挑战性，PG-19的规模是Billion Word基准的两倍，且文档的平均长度是WikiText基准的20倍。数据集被划分为训练集、验证集和测试集，以支持不同阶段的模型训练和评估。

特点

PG-19数据集的主要特点在于其庞大的规模和长文本特性，使其成为评估和训练长程语言模型的理想选择。数据集采用开放词汇表，未对罕见词汇进行特殊处理，允许用户在字符级、子词级或任意文本建模机制上进行操作。此外，PG-19的文本风格反映了历史时期的语言特征，尽管这可能带来一定的偏见，但也为研究历史语言风格和长程推理提供了独特的资源。

使用方法

PG-19数据集适用于多种自然语言处理任务，特别是那些需要长程推理的任务，如LAMBADA和NarrativeQA。用户可以通过HuggingFace的datasets库轻松加载和使用该数据集，利用其训练集进行模型预训练，验证集进行模型调优，测试集进行最终评估。值得注意的是，由于数据集中的文本具有历史时期的语言风格，因此不建议将其用于训练通用语言模型，尤其是面向现代对话系统的应用。

背景与挑战

背景概述

PG-19数据集由DeepMind于2019年发布，旨在为长序列语言建模提供一个具有挑战性的基准。该数据集从Project Gutenberg项目中提取了1919年之前出版的书籍，涵盖了丰富的历史文本资源。PG-19不仅在规模上超越了传统的Billion Word基准，且其文档平均长度是WikiText的20倍，为研究者提供了一个开放词汇的语言建模平台。该数据集的核心研究问题是如何有效处理和建模长距离依赖关系，这对于自然语言处理任务如LAMBADA和NarrativeQA具有重要意义。PG-19的发布为长序列语言模型的研究提供了新的视角和资源，推动了该领域的技术进步。

当前挑战

PG-19数据集在构建和应用过程中面临多项挑战。首先，处理长文本序列的复杂性要求模型具备高效的长距离依赖建模能力，这对现有模型的架构和计算资源提出了严峻考验。其次，历史文本的语言风格和词汇使用与现代语言存在显著差异，可能导致模型在泛化能力上的局限性。此外，数据集的开放词汇特性虽然增加了灵活性，但也带来了词汇稀疏性和歧义处理的问题。最后，历史文本中潜在的偏见和敏感信息需要谨慎处理，以避免在模型训练中引入不必要的社会影响。

常用场景

经典使用场景

PG-19数据集的经典使用场景主要集中在长序列语言模型的训练与评估。由于其包含的文本长度远超传统基准数据集，如WikiText，PG-19特别适用于需要长距离依赖建模的任务，如叙事理解（NarrativeQA）和长文本生成（LAMBADA）。通过利用PG-19，研究者能够开发和验证能够处理复杂长文本的模型，从而推动语言模型在长距离依赖任务中的表现。

解决学术问题

PG-19数据集解决了传统语言模型在处理长文本时面临的挑战，特别是长距离依赖问题。通过提供大量长篇文本，PG-19为研究者提供了一个理想的平台，用于开发和测试能够有效捕捉长距离上下文的模型。这不仅提升了语言模型在长文本生成和理解任务中的性能，还为相关领域的研究提供了新的基准，推动了长序列建模技术的发展。

衍生相关工作

PG-19数据集的发布激发了大量关于长序列语言模型的研究。例如，基于PG-19的训练，研究者提出了压缩变压器（Compressive Transformers），这是一种能够有效处理长距离依赖的模型架构。此外，PG-19还推动了其他长文本处理任务的研究，如叙事理解和长文本生成，进一步扩展了其在自然语言处理领域的应用范围。

以上内容由AI搜集并总结生成

用户留言

有没有相关的论文或文献参考？

这个数据集是基于什么背景创建的？

数据集的作者是谁？

能帮我联系到这个数据集的作者吗？

这个数据集如何下载？

点击留言

数据主题

具身智能

数据集 4098个

机构 8个

大模型

数据集 439个

机构 10个

无人机

数据集 37个

机构 6个

指令微调

数据集 36个

机构 6个

蛋白质结构

数据集 50个

机构 8个

空间智能

数据集 21个

机构 5个

5,000+

优质数据集

54 个

任务类型

进入经典数据集

热门数据集

MultiTalk

MultiTalk数据集是由韩国科学技术院创建，包含超过420小时的2D视频，涵盖20种不同语言，旨在解决多语言环境下3D说话头生成的问题。该数据集通过自动化管道从YouTube收集，每段视频都配有语言标签和伪转录，部分视频还包含伪3D网格顶点。数据集的创建过程包括视频收集、主动说话者验证和正面人脸验证，确保数据质量。MultiTalk数据集的应用领域主要集中在提升多语言3D说话头生成的准确性和表现力，通过引入语言特定风格嵌入，使模型能够捕捉每种语言独特的嘴部运动。

arXiv 收录

OMIM (Online Mendelian Inheritance in Man)

OMIM是一个包含人类基因和遗传疾病信息的在线数据库。它提供了详细的遗传疾病描述、基因定位、相关文献和临床信息。数据集内容包括疾病名称、基因名称、基因定位、遗传模式、临床特征、相关文献引用等。

www.omim.org 收录

全国 1∶200 000 数字地质图（公开版）空间数据库

As the only one of its kind, China National Digital Geological Map (Public Version at 1∶200 000 scale) Spatial Database (CNDGM-PVSD) is based on China' s former nationwide measured results of regional geological survey at 1∶200 000 scale, and is also one of the nationwide basic geosciences spatial databases jointly accomplished by multiple organizations of China. Spatially, it embraces 1 163 geological map-sheets (at scale 1: 200 000) in both formats of MapGIS and ArcGIS, covering 72% of China's whole territory with a total data volume of 90 GB. Its main sources is from 1∶200 000 regional geological survey reports, geological maps, and mineral resources maps with an original time span from mid-1950s to early 1990s. Approved by the State's related agencies, it meets all the related technical qualification requirements and standards issued by China Geological Survey in data integrity, logic consistency, location acc racy, attribution fineness, and collation precision, and is hence of excellent and reliable quality. The CNDGM-PVSD is an important component of China' s national spatial database categories, serving as a spatial digital platform for the information construction of the State's national economy, and providing informationbackbones to the national and provincial economic planning, geohazard monitoring, geological survey, mineral resources exploration as well as macro decision-making.

DataCite Commons 收录

UniProt

UniProt（Universal Protein Resource）是全球公认的蛋白质序列与功能信息权威数据库，由欧洲生物信息学研究所（EBI）、瑞士生物信息学研究所（SIB）和美国蛋白质信息资源中心（PIR）联合运营。该数据库以其广度和深度兼备的蛋白质信息资源闻名，整合了实验验证的高质量数据与大规模预测的自动注释内容，涵盖从分子序列、结构到功能的全面信息。UniProt核心包括注释详尽的UniProtKB知识库（分为人工校验的Swiss-Prot和自动生成的TrEMBL），以及支持高效序列聚类分析的UniRef和全局蛋白质序列归档的UniParc。其卓越的数据质量和多样化的检索工具，为基础研究和药物研发提供了无可替代的支持，成为生物学研究中不可或缺的资源。

www.uniprot.org 收录

CBIS-DDSM

该数据集用于训练乳腺癌分类器或分割模型，包含3103张乳腺X光片，其中465张有多个异常。数据集分为训练集和测试集，还包括3568张裁剪的乳腺X光片和对应的掩码。

github 收录