five

zkeown/gutenberg-corpus

收藏
Hugging Face2026-03-06 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/zkeown/gutenberg-corpus
下载链接
链接失效反馈
官方服务:
资源简介:
--- configs: - config_name: books data_files: - split: train path: books/*.parquet - config_name: chapters data_files: - split: train path: chapters/*.parquet - config_name: paragraphs data_files: - split: train path: paragraphs/*.parquet license: apache-2.0 task_categories: - text-generation language: - en - de - fr - es - it - pt - nl - fi - zh tags: - gutenberg - literature - public-domain - books pretty_name: Project Gutenberg Corpus size_categories: - 10K<n<100K --- # Project Gutenberg Corpus A comprehensive dataset of **74,007 public domain books** from [Project Gutenberg](https://www.gutenberg.org/), with rich structured metadata, chapter detection, and paragraph-level chunking. ## Configs | Config | Rows | Description | |--------|------|-------------| | | 74,007 | Full book text + 16 metadata columns | | | 650,408 | Chapter-level chunks | | | 91,853,326 | Paragraph-level chunks (ideal for RAG) | ## Usage ## Metadata Fields (books config) id, title, author, author_birth, author_death, subjects, bookshelves, loc_class, language, rights, contributors, summary, release_date, has_chapters, chapter_count, text ## Pipeline Built with [gutenberg-hf-dataset](https://github.com/zakkeown/gutenberg-hf-dataset). Updated weekly via GitHub Actions. ## License - **Code**: Apache 2.0 - **Texts**: Public domain (Project Gutenberg headers/footers stripped)
提供机构:
zkeown
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作