eminorhan/gutenberg_en

Name: eminorhan/gutenberg_en
Creator: eminorhan
Published: 2023-11-17 20:55:28
License: 暂无描述

Hugging Face2023-11-17 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/eminorhan/gutenberg_en

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit task_categories: - text-generation language: - en size_categories: - 10M<n<100M configs: - config_name: chunk_size_1024 data_files: "gutenberg_en_paragraph_1024.jsonl" - config_name: chunk_size_2048 data_files: "gutenberg_en_paragraph_2048.jsonl" --- **Description of the dataset** This is the November 16, 2023 snapshot of the English subset of the Project Gutenberg corpus (containing 56712 documents in total), downloaded and preprocessed with code from [this repository](https://github.com/eminorhan/gutenberg). Two different versions of the data are provided: * The `chunk_size_1024` version divides the data into ~14.2M records consisting of a few paragraph long chunks of text, where each chunk is at least 1024 chars long, and the corresponding metadata. * The `chunk_size_2048` version divides the data into ~8.2M records consisting of a few paragraph long chunks of text, where each chunk is at least 2048 chars long, and the corresponding metadata. This dataset is ideal for generating fine-grained embeddings of the documents.

提供机构：

eminorhan

原始信息汇总

数据集描述

该数据集是2023年11月16日的Project Gutenberg语料库的英文子集快照（总共包含56712个文档），通过此仓库中的代码下载和预处理。

提供了两种不同版本的数据：

chunk_size_1024版本将数据分成约14.2M条记录，每条记录包含几个段落长的文本块，每个文本块至少1024个字符，以及相应的元数据。
chunk_size_2048版本将数据分成约8.2M条记录，每条记录包含几个段落长的文本块，每个文本块至少2048个字符，以及相应的元数据。

该数据集非常适合生成文档的细粒度嵌入。

5,000+

优质数据集

54 个

任务类型

进入经典数据集