sanps/GutenbergFictionSummaryPrepared
收藏Hugging Face2024-01-02 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/sanps/GutenbergFictionSummaryPrepared
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: file_id
dtype: string
- name: messages
list:
- name: book_text
dtype: string
- name: summary_text
dtype: string
splits:
- name: train_full
num_bytes: 1638018755
num_examples: 150000
- name: sample_train
num_bytes: 54829870
num_examples: 5000
- name: val
num_bytes: 198773079
num_examples: 18238
- name: train1
num_bytes: 546267109
num_examples: 50000
- name: train2
num_bytes: 546204032
num_examples: 50000
- name: train3
num_bytes: 545547614
num_examples: 50000
- name: small_val
num_bytes: 54739280
num_examples: 5000
download_size: 2268933771
dataset_size: 3584379739
configs:
- config_name: default
data_files:
- split: train_full
path: data/train_full-*
- split: sample_train
path: data/sample_train-*
- split: val
path: data/val-*
- split: train1
path: data/train1-*
- split: train2
path: data/train2-*
- split: train3
path: data/train3-*
- split: small_val
path: data/small_val-*
license: mit
language:
- en
pretty_name: Gutenberg Fiction Summaries and Text
---
Description:
Created for training models on fiction generation. Dataset has pairs of LLM-generated summaries and corresponding narrative texts from popular English fiction on Project Gutenberg.
Orignal dataset: sanps/GutenbergFictionSummary
Summaries are produced by cognitivecomputations/dolphin-2.6-mistral-7b.
The text are from English fiction books on Project Gutenberg, tagged for fiction and with a minimum of 25 downloads to ensure quality and interest. The dataset is organized into different splits. Each entry in a split consist of 1-4 contiguous book sections and summaries.
Splits:
- train_full: 150k rows
- sample_train: 5k rows
- val: 18.2k rows
- train1, train2, train3: 50k rows each
- small_val: 5k rows
Data Format:
JSON array of objects:
[ {"summary_text": "Generated summary", "book_text": "Extended text"}, ... (up to 4 pairs per entry)]
File ID:
The id of the book in Project Gutenberg.
Licensing:
See Project Gutenberg's policy: https://www.gutenberg.org/policy/permission.html
This dataset is created for training models on fiction generation, containing pairs of LLM-generated summaries and corresponding narrative texts from popular English fiction on Project Gutenberg. The texts are from English fiction books on Project Gutenberg, tagged for fiction and with a minimum of 25 downloads to ensure quality and interest. The summaries are produced by cognitivecomputations/dolphin-2.6-mistral-7b. The dataset is organized into different splits, each consisting of 1-4 contiguous book sections and summaries. The data format is a JSON array of objects, each containing a generated summary and corresponding book text.
提供机构:
sanps
原始信息汇总
数据集概述
数据集信息
-
特征:
file_id: 类型为字符串messages: 列表类型,包含以下字段book_text: 类型为字符串summary_text: 类型为字符串
-
分割:
train_full: 150000个样本,大小为1638018755字节sample_train: 5000个样本,大小为54829870字节val: 18238个样本,大小为198773079字节train1: 50000个样本,大小为546267109字节train2: 50000个样本,大小为546204032字节train3: 50000个样本,大小为545547614字节small_val: 5000个样本,大小为54739280字节
-
下载大小: 2268933771字节
-
数据集大小: 3584379739字节
配置
- 配置名称: default
- 数据文件:
train_full: 路径为data/train_full-*sample_train: 路径为data/sample_train-*val: 路径为data/val-*train1: 路径为data/train1-*train2: 路径为data/train2-*train3: 路径为data/train3-*small_val: 路径为data/small_val-*
- 数据文件:
许可
- 许可证: MIT
语言
- 语言: 英语
数据集名称
- 名称: Gutenberg Fiction Summaries and Text



