zkeown/gutenberg-corpus

Name: zkeown/gutenberg-corpus
Creator: zkeown
Published: 2026-03-06 05:31:13
License: 暂无描述

Hugging Face2026-03-06 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/zkeown/gutenberg-corpus

下载链接

链接失效反馈

官方服务：

资源简介：

--- configs: - config_name: books data_files: - split: train path: books/*.parquet - config_name: chapters data_files: - split: train path: chapters/*.parquet - config_name: paragraphs data_files: - split: train path: paragraphs/*.parquet license: apache-2.0 task_categories: - text-generation language: - en - de - fr - es - it - pt - nl - fi - zh tags: - gutenberg - literature - public-domain - books pretty_name: Project Gutenberg Corpus size_categories: - 10K<n<100K --- # Project Gutenberg Corpus A comprehensive dataset of **74,007 public domain books** from [Project Gutenberg](https://www.gutenberg.org/), with rich structured metadata, chapter detection, and paragraph-level chunking. ## Configs | Config | Rows | Description | |--------|------|-------------| | | 74,007 | Full book text + 16 metadata columns | | | 650,408 | Chapter-level chunks | | | 91,853,326 | Paragraph-level chunks (ideal for RAG) | ## Usage ## Metadata Fields (books config) id, title, author, author_birth, author_death, subjects, bookshelves, loc_class, language, rights, contributors, summary, release_date, has_chapters, chapter_count, text ## Pipeline Built with [gutenberg-hf-dataset](https://github.com/zakkeown/gutenberg-hf-dataset). Updated weekly via GitHub Actions. ## License - **Code**: Apache 2.0 - **Texts**: Public domain (Project Gutenberg headers/footers stripped)

提供机构：

zkeown

5,000+

优质数据集

54 个

任务类型

进入经典数据集