sedthh/gutenberg_english
收藏Hugging Face2023-03-17 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/sedthh/gutenberg_english
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: TEXT
dtype: string
- name: SOURCE
dtype: string
- name: METADATA
dtype: string
splits:
- name: train
num_bytes: 18104255935
num_examples: 48284
download_size: 10748877194
dataset_size: 18104255935
license: mit
task_categories:
- text-generation
language:
- en
tags:
- project gutenberg
- e-book
- gutenberg.org
pretty_name: Project Gutenberg eBooks in English
size_categories:
- 10K<n<100K
---
# Dataset Card for Project Gutenber - English Language eBooks
A collection of non-english language eBooks (48284 rows, 80%+ of all english language books available on the site) from the Project Gutenberg site with metadata removed.
Originally colected for https://github.com/LAION-AI/Open-Assistant (follows the OpenAssistant training format)
The METADATA column contains catalogue meta information on each book as a serialized JSON:
| key | original column |
|----|----|
| language | - |
| text_id | Text# unique book identifier on Prject Gutenberg as *int* |
| title | Title of the book as *string* |
| issued | Issued date as *string* |
| authors | Authors as *string*, comma separated sometimes with dates |
| subjects | Subjects as *string*, various formats |
| locc | LoCC code as *string* |
| bookshelves | Bookshelves as *string*, optional |
## Source data
**How was the data generated?**
- A crawler (see Open-Assistant repository) downloaded the raw HTML code for
each eBook based on **Text#** id in the Gutenberg catalogue (if available)
- The metadata and the body of text are not clearly separated so an additional
parser attempts to split them, then remove transcriber's notes and e-book
related information from the body of text (text clearly marked as copyrighted or
malformed was skipped and not collected)
- The body of cleaned TEXT as well as the catalogue METADATA is then saved as
a parquet file, with all columns being strings
**Copyright notice:**
- Some of the books are copyrighted! The crawler ignored all books
with an english copyright header by utilizing a regex expression, but make
sure to check out the metadata for each book manually to ensure they are okay
to use in your country! More information on copyright:
https://www.gutenberg.org/help/copyright.html and
https://www.gutenberg.org/policy/permission.html
- Project Gutenberg has the following requests when using books without
metadata: _Books obtianed from the Project Gutenberg site should have the
following legal note next to them: "This eBook is for the use of anyone
anywhere in the United States and most other parts of the world at no cost and
with almost" no restrictions whatsoever. You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included with this
eBook or online at www.gutenberg.org. If you are not located in the United
States, you will have to check the laws of the country where you are located
before using this eBook."_
提供机构:
sedthh
原始信息汇总
数据集概述
基本信息
- 名称: Project Gutenberg eBooks in English
- 语言: 英语 (en)
- 任务类别: 文本生成 (text-generation)
- 许可证: MIT
- 标签:
- Project Gutenberg
- e-book
- gutenberg.org
数据集结构
- 特征:
- TEXT: 文本内容,字符串类型
- SOURCE: 来源信息,字符串类型
- METADATA: 元数据信息,字符串类型
- 分割:
- train: 包含48284个样本,总大小为18104255935字节
- 下载大小: 10748877194字节
- 数据集大小: 18104255935字节
元数据详情
- METADATA 列包含的详细信息:
- language: 语言
- text_id: 唯一书籍标识符,整数类型
- title: 书籍标题,字符串类型
- issued: 发行日期,字符串类型
- authors: 作者,字符串类型,有时包含日期,逗号分隔
- subjects: 主题,字符串类型,格式多样
- locc: 美国国会图书馆分类代码,字符串类型
- bookshelves: 书架分类,字符串类型,可选
数据生成方式
- 使用爬虫从Gutenberg目录下载原始HTML代码。
- 通过额外的解析器尝试分离元数据和文本主体,并移除转录者笔记和电子书相关信息。
- 清洗后的文本主体和目录元数据保存为parquet文件,所有列均为字符串类型。
版权注意事项
- 部分书籍受版权保护,爬虫通过正则表达式忽略所有带有英文版权头的书籍。
- 使用前需手动检查每本书的元数据,确保符合使用国的法律规定。



