five

SimpleBooks

收藏
arXiv2019-11-28 更新2024-06-21 收录
下载链接:
https://dldata-public.s3.us-east-2.amazonaws.com/simplebooks.zip
下载链接
链接失效反馈
官方服务:
资源简介:
SimpleBooks是一个专为语言模型训练设计的小型数据集,由英伟达创建。该数据集包含92M个单词级别的令牌,词汇量为98K,远小于同等规模的WikiText-103。数据来源于1,573本Gutenberg书籍,这些书籍因其单词长度与词汇量的高比率而被选中。创建过程中,书籍经过筛选和预处理,保留了原始的案例和标点符号。SimpleBooks适用于快速实验和架构搜索,旨在解决大型数据集在模型训练中的成本和效率问题。

SimpleBooks is a small-scale dataset specifically designed for language model training, created by NVIDIA. It contains 92 million word-level tokens and has a vocabulary size of 98K, which is much smaller than that of WikiText-103 of the same scale. The dataset is sourced from 1,573 books from Project Gutenberg, which were selected for their high ratio of total word count to vocabulary size. During the creation process, the books were filtered and preprocessed while retaining their original capitalization and punctuation. SimpleBooks is suitable for rapid experimentation and architecture search, aiming to address the cost and efficiency issues associated with large datasets during model training.
提供机构:
英伟达
创建时间:
2019-11-28
搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
SimpleBooks是一个由英伟达创建的小型语言模型训练数据集,包含92M单词级别令牌和98K词汇量,数据来源于1,573本Gutenberg书籍,经过筛选保留了原始案例和标点符号。它专为快速实验和架构搜索设计,旨在降低大型数据集训练的成本和提高效率。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作