sedthh/gutenberg_english

Name: sedthh/gutenberg_english
Creator: sedthh
Published: 2023-03-17 09:50:22
License: 暂无描述

Hugging Face2023-03-17 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/sedthh/gutenberg_english

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: TEXT dtype: string - name: SOURCE dtype: string - name: METADATA dtype: string splits: - name: train num_bytes: 18104255935 num_examples: 48284 download_size: 10748877194 dataset_size: 18104255935 license: mit task_categories: - text-generation language: - en tags: - project gutenberg - e-book - gutenberg.org pretty_name: Project Gutenberg eBooks in English size_categories: - 10K<n<100K --- # Dataset Card for Project Gutenber - English Language eBooks A collection of non-english language eBooks (48284 rows, 80%+ of all english language books available on the site) from the Project Gutenberg site with metadata removed. Originally colected for https://github.com/LAION-AI/Open-Assistant (follows the OpenAssistant training format) The METADATA column contains catalogue meta information on each book as a serialized JSON: | key | original column | |----|----| | language | - | | text_id | Text# unique book identifier on Prject Gutenberg as *int* | | title | Title of the book as *string* | | issued | Issued date as *string* | | authors | Authors as *string*, comma separated sometimes with dates | | subjects | Subjects as *string*, various formats | | locc | LoCC code as *string* | | bookshelves | Bookshelves as *string*, optional | ## Source data **How was the data generated?** - A crawler (see Open-Assistant repository) downloaded the raw HTML code for each eBook based on **Text#** id in the Gutenberg catalogue (if available) - The metadata and the body of text are not clearly separated so an additional parser attempts to split them, then remove transcriber's notes and e-book related information from the body of text (text clearly marked as copyrighted or malformed was skipped and not collected) - The body of cleaned TEXT as well as the catalogue METADATA is then saved as a parquet file, with all columns being strings **Copyright notice:** - Some of the books are copyrighted! The crawler ignored all books with an english copyright header by utilizing a regex expression, but make sure to check out the metadata for each book manually to ensure they are okay to use in your country! More information on copyright: https://www.gutenberg.org/help/copyright.html and https://www.gutenberg.org/policy/permission.html - Project Gutenberg has the following requests when using books without metadata: _Books obtianed from the Project Gutenberg site should have the following legal note next to them: "This eBook is for the use of anyone anywhere in the United States and most other parts of the world at no cost and with almost" no restrictions whatsoever. You may copy it, give it away or re-use it under the terms of the Project Gutenberg License included with this eBook or online at www.gutenberg.org. If you are not located in the United States, you will have to check the laws of the country where you are located before using this eBook."_

提供机构：

sedthh

原始信息汇总

数据集概述

基本信息

名称: Project Gutenberg eBooks in English
语言: 英语 (en)
任务类别: 文本生成 (text-generation)
许可证: MIT
标签:
- Project Gutenberg
- e-book
- gutenberg.org

数据集结构

特征:
- TEXT: 文本内容，字符串类型
- SOURCE: 来源信息，字符串类型
- METADATA: 元数据信息，字符串类型
分割:
- train: 包含48284个样本，总大小为18104255935字节
下载大小: 10748877194字节
数据集大小: 18104255935字节

元数据详情

METADATA 列包含的详细信息:
- language: 语言
- text_id: 唯一书籍标识符，整数类型
- title: 书籍标题，字符串类型
- issued: 发行日期，字符串类型
- authors: 作者，字符串类型，有时包含日期，逗号分隔
- subjects: 主题，字符串类型，格式多样
- locc: 美国国会图书馆分类代码，字符串类型
- bookshelves: 书架分类，字符串类型，可选

数据生成方式

使用爬虫从Gutenberg目录下载原始HTML代码。
通过额外的解析器尝试分离元数据和文本主体，并移除转录者笔记和电子书相关信息。
清洗后的文本主体和目录元数据保存为parquet文件，所有列均为字符串类型。

版权注意事项

部分书籍受版权保护，爬虫通过正则表达式忽略所有带有英文版权头的书籍。
使用前需手动检查每本书的元数据，确保符合使用国的法律规定。

5,000+

优质数据集

54 个

任务类型

进入经典数据集