BEE-spoke-data/rp_books-en
收藏Hugging Face2024-05-12 更新2024-06-15 收录
下载链接:
https://hf-mirror.com/datasets/BEE-spoke-data/rp_books-en
下载链接
链接失效反馈官方服务:
资源简介:
该数据集是对`togethercomputer/Long-Data-Collections`中red pajama books子集进行过滤和清理的结果。数据集包含多个配置,如`clean`、`default`、`embeddings-jina-base`等,每个配置具有不同的特征和数据集大小。数据集主要用于文本生成、特征提取和填充掩码等任务,语言为英语,标签包括书籍和长文档。
This dataset is the filtered and cleaned result of the red pajama books subset from the `togethercomputer/Long-Data-Collections`. It includes multiple configurations such as `clean`, `default`, `embeddings-jina-base`, and others, each featuring distinct characteristics and dataset sizes. This dataset is mainly used for tasks like text generation, feature extraction, and mask filling, with English as its primary language, and its labels include books and long documents.
提供机构:
BEE-spoke-data
原始信息汇总
数据集概述
数据集配置
配置名称:clean
- 特征:
- meta:
- publication_date (int64)
- short_book_title (string)
- url (string)
- text (string)
- first_25k (string)
- score (float64)
- meta:
- 分割:
- train:
- 字节数:10956591018.806879
- 样本数:25575
- train:
- 下载大小:6784885445
- 数据集大小:10956591018.806879
配置名称:default
- 特征:
- meta:
- publication_date (int64)
- short_book_title (string)
- url (string)
- text (string)
- meta:
- 分割:
- train:
- 字节数:10580548205.687407
- 样本数:26372
- train:
- 下载大小:6635583644
- 数据集大小:10580548205.687407
配置名称:embeddings-jina-base
- 特征:
- meta:
- publication_date (int64)
- short_book_title (string)
- url (string)
- text (string)
- embedding (sequence: float64)
- meta:
- 分割:
- train:
- 字节数:10801330292
- 样本数:26372
- train:
- 下载大小:6772846092
- 数据集大小:10801330292
配置名称:filtered-clean_grade
- 特征:
- meta:
- publication_date (int64)
- short_book_title (string)
- url (string)
- text (string)
- score (float64)
- meta:
- 分割:
- train:
- 字节数:1132451934.8929183
- 样本数:2918
- train:
- 下载大小:694597113
- 数据集大小:1132451934.8929183
配置名称:filtered-mild_grade
- 特征:
- meta:
- publication_date (int64)
- short_book_title (string)
- url (string)
- text (string)
- score (float64)
- meta:
- 分割:
- train:
- 字节数:4869464328.592873
- 样本数:12018
- train:
- 下载大小:3021366037
- 数据集大小:4869464328.592873
配置名称:graded
- 特征:
- meta:
- publication_date (int64)
- short_book_title (string)
- url (string)
- text (string)
- label (string)
- score (float64)
- meta:
- 分割:
- train:
- 字节数:10639835144
- 样本数:26372
- train:
- 下载大小:6599881939
- 数据集大小:10639835144
数据文件路径
- clean:
- train:clean/train-*
- default:
- train:data/train-*
- embeddings-jina-base:
- train:embeddings-jina-base/train-*
- filtered-clean_grade:
- train:filtered-clean_grade/train-*
- filtered-mild_grade:
- train:filtered-mild_grade/train-*
- graded:
- train:graded/train-*
任务类别
- 文本生成
- 特征提取
- 填充掩码
语言
- 英语
标签
- 书籍
- 长文档



