five

defunct-datasets/the_pile_books3

收藏
Hugging Face2024-01-18 更新2024-06-15 收录
下载链接:
https://hf-mirror.com/datasets/defunct-datasets/the_pile_books3
下载链接
链接失效反馈
官方服务:
资源简介:
该数据集名为the_pile_books3,是Shawn Presser的作品,属于EleutherAi/The Pile数据集的一部分。数据集包含了197,000本书的纯文本形式,处理方式与bookcorpusopen(也称为books1)相同。数据集主要用于语言建模任务,语言为英语。数据集的结构包括书名和文本内容两个字段,数据分为训练集,包含196,640个样本。由于版权问题,该数据集已无法访问。

This dataset, named the_pile_books3, is a work by Shawn Presser and is part of the EleutherAI/The Pile dataset. It contains plaintext versions of 197,000 books, processed using the same pipeline as bookcorpusopen (also known as books1). The dataset is primarily intended for language modeling tasks, with all text in English. Its structure includes two fields: book title and text content, and the training split contains 196,640 samples. Due to copyright issues, this dataset is no longer accessible.
提供机构:
defunct-datasets
原始信息汇总

数据集概述

基本信息

  • 数据集名称: Books3
  • 语言: 英语
  • 许可证: MIT
  • 多语言性: 单语种
  • 大小类别: 100K<n<1M
  • 源数据: 原始数据
  • 任务类别: 文本生成、填充掩码
  • 任务ID: 语言建模、掩码语言建模
  • 可视化: 否

数据集结构

特征

  • 标题: 字符串类型
  • 文本: 字符串类型

配置

  • 配置名称: plain_text

数据分割

  • 训练集:
    • 字节数: 108392037000
    • 样本数: 196639

下载和数据大小

  • 下载大小: 39516981435
  • 数据集大小: 108392037000

数据实例

{title: 07 LEGO Ninjago - The Search For Zane (Scholastic) - Kate Howard (retail), text:

TITLE PAGE

FROM THE JOURNAL OF SENSEI GARMADON

CHAPTER 1

CHAPTER 2

CHAPTER 3

CHAPTER 4

CHAPTER 5

CHAPTER 6

CHAPTER 7

CHAPTER 8

CHAPTER 9

COPYRIGHT

Throughout Ninjago", five ninja are well-known for their speed, strength, and  of course  the elemental powers that help them protect our world from evil. But there are others who possess some of the same powers as the ninja. Others who may not always use their powers for good.

Before now, the ninja believed they were special. They di.......}

数据字段

  • title: 书籍标题
  • text: 书籍文本内容

数据分割

分割 样本数
train 196640

许可证信息

MIT

引用信息

@article{pile, title={The {P}ile: An 800GB Dataset of Diverse Text for Language Modeling}, author={Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and Presser, Shawn and Leahy, Connor}, journal={arXiv preprint arXiv:2101.00027}, year={2020} }

搜集汇总
数据集介绍
main_image_url
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作