defunct-datasets/the_pile_books3
收藏数据集概述
基本信息
- 数据集名称: Books3
- 语言: 英语
- 许可证: MIT
- 多语言性: 单语种
- 大小类别: 100K<n<1M
- 源数据: 原始数据
- 任务类别: 文本生成、填充掩码
- 任务ID: 语言建模、掩码语言建模
- 可视化: 否
数据集结构
特征
- 标题: 字符串类型
- 文本: 字符串类型
配置
- 配置名称: plain_text
数据分割
- 训练集:
- 字节数: 108392037000
- 样本数: 196639
下载和数据大小
- 下载大小: 39516981435
- 数据集大小: 108392037000
数据实例
{title: 07 LEGO Ninjago - The Search For Zane (Scholastic) - Kate Howard (retail), text:
TITLE PAGE
FROM THE JOURNAL OF SENSEI GARMADON
CHAPTER 1
CHAPTER 2
CHAPTER 3
CHAPTER 4
CHAPTER 5
CHAPTER 6
CHAPTER 7
CHAPTER 8
CHAPTER 9
COPYRIGHT
Throughout Ninjago", five ninja are well-known for their speed, strength, and of course the elemental powers that help them protect our world from evil. But there are others who possess some of the same powers as the ninja. Others who may not always use their powers for good.
Before now, the ninja believed they were special. They di.......}
数据字段
title: 书籍标题text: 书籍文本内容
数据分割
| 分割 | 样本数 |
|---|---|
| train | 196640 |
许可证信息
MIT
引用信息
@article{pile, title={The {P}ile: An 800GB Dataset of Diverse Text for Language Modeling}, author={Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and Presser, Shawn and Leahy, Connor}, journal={arXiv preprint arXiv:2101.00027}, year={2020} }




