zkeown/gutenberg-corpus
收藏Hugging Face2026-03-06 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/zkeown/gutenberg-corpus
下载链接
链接失效反馈官方服务:
资源简介:
---
configs:
- config_name: books
data_files:
- split: train
path: books/*.parquet
- config_name: chapters
data_files:
- split: train
path: chapters/*.parquet
- config_name: paragraphs
data_files:
- split: train
path: paragraphs/*.parquet
license: apache-2.0
task_categories:
- text-generation
language:
- en
- de
- fr
- es
- it
- pt
- nl
- fi
- zh
tags:
- gutenberg
- literature
- public-domain
- books
pretty_name: Project Gutenberg Corpus
size_categories:
- 10K<n<100K
---
# Project Gutenberg Corpus
A comprehensive dataset of **74,007 public domain books** from [Project Gutenberg](https://www.gutenberg.org/), with rich structured metadata, chapter detection, and paragraph-level chunking.
## Configs
| Config | Rows | Description |
|--------|------|-------------|
| | 74,007 | Full book text + 16 metadata columns |
| | 650,408 | Chapter-level chunks |
| | 91,853,326 | Paragraph-level chunks (ideal for RAG) |
## Usage
## Metadata Fields (books config)
id, title, author, author_birth, author_death, subjects, bookshelves, loc_class, language, rights, contributors, summary, release_date, has_chapters, chapter_count, text
## Pipeline
Built with [gutenberg-hf-dataset](https://github.com/zakkeown/gutenberg-hf-dataset). Updated weekly via GitHub Actions.
## License
- **Code**: Apache 2.0
- **Texts**: Public domain (Project Gutenberg headers/footers stripped)
提供机构:
zkeown



