enelpol/gutenberg_selected_ebooks
收藏Hugging Face2024-10-30 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/enelpol/gutenberg_selected_ebooks
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
config_name: corpus
features:
- name: passage
dtype: string
- name: id
dtype: int64
- name: author
dtype: string
- name: title
dtype: string
- name: gutenberg_source_id
dtype: int64
splits:
- name: train
num_bytes: 2633492
num_examples: 1251
download_size: 1578596
dataset_size: 2633492
configs:
- config_name: corpus
data_files:
- split: train
path: corpus/train-*
license: mit
task_categories:
- question-answering
language:
- en
tags:
- ebook
- project_gutenberg
size_categories:
- 1K<n<10K
---
# Gutenberg selected ebooks dataset
This dataset is a collection of passages from ebooks handpicked from the [Gutenberg Project](https://www.gutenberg.org/).
These writings are:
* Alice's Adventures in Wonderland
* Pride and Prejudice
* Romeo and Juliet
* The Adventures of Sherlock Holmes
* The Odyssey
* Winnie-the-Pooh
# Source
The texts of the passages were derived from a larger Gutenberg-based set: [sedthh/gutenberg_english](https://huggingface.co/datasets/sedthh/gutenberg_english), which was sourced directly from the project's site.
# Metadata
Each passage contains four metadata fields:
| key | description |
|----|----|
| id | Passage unique identifier as *int* |
| title | Title of the book as *string* |
| author | Author's identity as *string*|
| gutenberg_source_id | Text# unique book identifier on Project Gutenberg as *int* |
# Copyrights
A note from the source dataset, applicable to this data as well:
- Some of the books are copyrighted! The crawler ignored all books
with an english copyright header by utilizing a regex expression, but make
sure to check out the metadata for each book manually to ensure they are okay
to use in your country! More information on copyright:
https://www.gutenberg.org/help/copyright.html and
https://www.gutenberg.org/policy/permission.html
- Project Gutenberg has the following requests when using books without
metadata: *Books obtianed from the Project Gutenberg site should have the
following legal note next to them: "This eBook is for the use of anyone
anywhere in the United States and most other parts of the world at no cost and
with almost" no restrictions whatsoever. You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included with this
eBook or online at www.gutenberg.org. If you are not located in the United
States, you will have to check the laws of the country where you are located
before using this eBook."*
提供机构:
enelpol



