SimpleBooks
收藏arXiv2019-11-28 更新2024-06-21 收录
下载链接:
https://dldata-public.s3.us-east-2.amazonaws.com/simplebooks.zip
下载链接
链接失效反馈官方服务:
资源简介:
SimpleBooks是一个专为语言模型训练设计的小型数据集,由英伟达创建。该数据集包含92M个单词级别的令牌,词汇量为98K,远小于同等规模的WikiText-103。数据来源于1,573本Gutenberg书籍,这些书籍因其单词长度与词汇量的高比率而被选中。创建过程中,书籍经过筛选和预处理,保留了原始的案例和标点符号。SimpleBooks适用于快速实验和架构搜索,旨在解决大型数据集在模型训练中的成本和效率问题。
SimpleBooks is a small-scale dataset specifically designed for language model training, created by NVIDIA. It contains 92 million word-level tokens and has a vocabulary size of 98K, which is much smaller than that of WikiText-103 of the same scale. The dataset is sourced from 1,573 books from Project Gutenberg, which were selected for their high ratio of total word count to vocabulary size. During the creation process, the books were filtered and preprocessed while retaining their original capitalization and punctuation. SimpleBooks is suitable for rapid experimentation and architecture search, aiming to address the cost and efficiency issues associated with large datasets during model training.
提供机构:
英伟达
创建时间:
2019-11-28
搜集汇总
数据集介绍

背景与挑战
背景概述
SimpleBooks是一个由英伟达创建的小型语言模型训练数据集,包含92M单词级别令牌和98K词汇量,数据来源于1,573本Gutenberg书籍,经过筛选保留了原始案例和标点符号。它专为快速实验和架构搜索设计,旨在降低大型数据集训练的成本和提高效率。
以上内容由遇见数据集搜集并总结生成



