five

enelpol/gutenberg_selected_ebooks

收藏
Hugging Face2024-10-30 更新2025-04-12 收录
下载链接:
https://hf-mirror.com/datasets/enelpol/gutenberg_selected_ebooks
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: config_name: corpus features: - name: passage dtype: string - name: id dtype: int64 - name: author dtype: string - name: title dtype: string - name: gutenberg_source_id dtype: int64 splits: - name: train num_bytes: 2633492 num_examples: 1251 download_size: 1578596 dataset_size: 2633492 configs: - config_name: corpus data_files: - split: train path: corpus/train-* license: mit task_categories: - question-answering language: - en tags: - ebook - project_gutenberg size_categories: - 1K<n<10K --- # Gutenberg selected ebooks dataset This dataset is a collection of passages from ebooks handpicked from the [Gutenberg Project](https://www.gutenberg.org/). These writings are: * Alice's Adventures in Wonderland * Pride and Prejudice * Romeo and Juliet * The Adventures of Sherlock Holmes * The Odyssey * Winnie-the-Pooh # Source The texts of the passages were derived from a larger Gutenberg-based set: [sedthh/gutenberg_english](https://huggingface.co/datasets/sedthh/gutenberg_english), which was sourced directly from the project's site. # Metadata Each passage contains four metadata fields: | key | description | |----|----| | id | Passage unique identifier as *int* | | title | Title of the book as *string* | | author | Author's identity as *string*| | gutenberg_source_id | Text# unique book identifier on Project Gutenberg as *int* | # Copyrights A note from the source dataset, applicable to this data as well: - Some of the books are copyrighted! The crawler ignored all books with an english copyright header by utilizing a regex expression, but make sure to check out the metadata for each book manually to ensure they are okay to use in your country! More information on copyright: https://www.gutenberg.org/help/copyright.html and https://www.gutenberg.org/policy/permission.html - Project Gutenberg has the following requests when using books without metadata: *Books obtianed from the Project Gutenberg site should have the following legal note next to them: "This eBook is for the use of anyone anywhere in the United States and most other parts of the world at no cost and with almost" no restrictions whatsoever. You may copy it, give it away or re-use it under the terms of the Project Gutenberg License included with this eBook or online at www.gutenberg.org. If you are not located in the United States, you will have to check the laws of the country where you are located before using this eBook."*
提供机构:
enelpol
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作