jukofyork/gutenberg-fiction-paragraphs
收藏Hugging Face2025-09-11 更新2025-10-25 收录
下载链接:
https://hf-mirror.com/datasets/jukofyork/gutenberg-fiction-paragraphs
下载链接
链接失效反馈官方服务:
资源简介:
古腾堡小说段落数据集是从大约15000本小说书籍中创建的,这些书籍是从Project Gutenberg下载的。书籍经过了大量清理,包括修正段落中的换行符、移除前后内容和模板文本、排除非常小的文件等。最终得到的14.3k本清理过的书籍被按段落拆分,只保留了75到2000字符长度的段落。数据集还进行了去重和重新打乱处理。
This dataset was created from around 15k fiction books downloaded from Project Gutenberg. The books were extensively cleaned, including fixing mid-paragraph linebreaks, removing all front matter, end matter, and other boilerplate, excluding any very small files which were obviously not books, and many other small fixes. The resulting 14.3k cleaned books were then split by paragraph and only those paragraphs between 75 and 2000 characters were retained in the text field. The dataset was then de-duplicated and re-shuffled.
提供机构:
jukofyork



