SimpleBooks

arXiv2019-11-28 更新2024-06-21 收录

下载链接：

https://dldata-public.s3.us-east-2.amazonaws.com/simplebooks.zip

下载链接

链接失效反馈

官方服务：

资源简介：

SimpleBooks是一个专为语言模型训练设计的小型数据集，由英伟达创建。该数据集包含92M个单词级别的令牌，词汇量为98K，远小于同等规模的WikiText-103。数据来源于1,573本Gutenberg书籍，这些书籍因其单词长度与词汇量的高比率而被选中。创建过程中，书籍经过筛选和预处理，保留了原始的案例和标点符号。SimpleBooks适用于快速实验和架构搜索，旨在解决大型数据集在模型训练中的成本和效率问题。

SimpleBooks is a small-scale dataset specifically designed for language model training, created by NVIDIA. It contains 92 million word-level tokens and has a vocabulary size of 98K, which is much smaller than that of WikiText-103 of the same scale. The dataset is sourced from 1,573 books from Project Gutenberg, which were selected for their high ratio of total word count to vocabulary size. During the creation process, the books were filtered and preprocessed while retaining their original capitalization and punctuation. SimpleBooks is suitable for rapid experimentation and architecture search, aiming to address the cost and efficiency issues associated with large datasets during model training.

提供机构：

英伟达

创建时间：

2019-11-28

搜集汇总

数据集介绍