eminorhan/gutenberg_en
收藏Hugging Face2023-11-17 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/eminorhan/gutenberg_en
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
task_categories:
- text-generation
language:
- en
size_categories:
- 10M<n<100M
configs:
- config_name: chunk_size_1024
data_files: "gutenberg_en_paragraph_1024.jsonl"
- config_name: chunk_size_2048
data_files: "gutenberg_en_paragraph_2048.jsonl"
---
**Description of the dataset**
This is the November 16, 2023 snapshot of the English subset of the Project Gutenberg corpus (containing 56712 documents in total), downloaded and preprocessed with code from [this repository](https://github.com/eminorhan/gutenberg).
Two different versions of the data are provided:
* The `chunk_size_1024` version divides the data into ~14.2M records consisting of a few paragraph long chunks of text, where each chunk is at least 1024 chars long, and the corresponding metadata.
* The `chunk_size_2048` version divides the data into ~8.2M records consisting of a few paragraph long chunks of text, where each chunk is at least 2048 chars long, and the corresponding metadata.
This dataset is ideal for generating fine-grained embeddings of the documents.
提供机构:
eminorhan
原始信息汇总
数据集描述
该数据集是2023年11月16日的Project Gutenberg语料库的英文子集快照(总共包含56712个文档),通过此仓库中的代码下载和预处理。
提供了两种不同版本的数据:
chunk_size_1024版本将数据分成约14.2M条记录,每条记录包含几个段落长的文本块,每个文本块至少1024个字符,以及相应的元数据。chunk_size_2048版本将数据分成约8.2M条记录,每条记录包含几个段落长的文本块,每个文本块至少2048个字符,以及相应的元数据。
该数据集非常适合生成文档的细粒度嵌入。



