nickypro/minipile-split
收藏Hugging Face2024-07-18 更新2024-07-22 收录
下载链接:
https://hf-mirror.com/datasets/nickypro/minipile-split
下载链接
链接失效反馈官方服务:
资源简介:
该数据集是armelrs数据集的一个较小版本,该数据集是The Pile数据集的一个分割版本。这个版本的特点包括包含训练集和测试集、易于下载(约2.3GB)以及可以选择文本分割。
This dataset is a smaller version of armelrs dataset, which is a split version of the general text dataset, The Pile. Features of this version include having both a train and test set, being easily downloadable in ~2.3GB, and allowing for text split selection.
提供机构:
nickypro
原始信息汇总
数据集概述
该数据集是"The Pile"的一个分割版本,包含多个子数据集,每个子数据集都有训练集和测试集。以下是各子数据集的详细信息:
子数据集列表
ArXiv
- 特征:
text: stringmeta: structpile_set_name: string
domain: string
- 分割:
train: 105393748 bytes, 2212 examplestest: 33002660 bytes, 745 examples
- 下载大小: 128240714 bytes
- 数据集大小: 138396408 bytes
BookCorpus2
- 特征:
text: stringmeta: structpile_set_name: string
domain: string
- 分割:
train: 92009676 bytes, 257 examplestest: 2769179 bytes, 8 examples
- 下载大小: 112489690 bytes
- 数据集大小: 94778855 bytes
Books3
- 特征:
text: stringmeta: structpile_set_name: string
domain: string
- 分割:
train: 108880516 bytes, 197 examplestest: 41526969 bytes, 87 examples
- 下载大小: 177202552 bytes
- 数据集大小: 150407485 bytes
DM Mathematics
- 特征:
text: stringmeta: structpile_set_name: string
domain: string
- 分割:
train: 104858342 bytes, 12735 examplestest: 4941242 bytes, 601 examples
- 下载大小: 55363453 bytes
- 数据集大小: 109799584 bytes
Enron Emails
- 特征:
text: stringmeta: structpile_set_name: string
domain: string
- 分割:
train: 16701351 bytes, 9378 examplestest: 541500 bytes, 291 examples
- 下载大小: 19423900 bytes
- 数据集大小: 17242851 bytes
EuroParl
- 特征:
text: stringmeta: structpile_set_name: string
domain: string
- 分割:
train: 94200121 bytes, 1334 examplestest: 3629223 bytes, 42 examples
- 下载大小: 106611978 bytes
- 数据集大小: 97829344 bytes
FreeLaw
- 特征:
text: stringmeta: structpile_set_name: string
domain: string
- 分割:
train: 106892914 bytes, 6519 examplestest: 25162182 bytes, 1588 examples
- 下载大小: 136641886 bytes
- 数据集大小: 132055096 bytes
Github
- 特征:
text: stringmeta: structpile_set_name: string
domain: string
- 分割:
train: 103649709 bytes, 19216 examplestest: 31333794 bytes, 5651 examples
- 下载大小: 100372522 bytes
- 数据集大小: 134983503 bytes
Gutenberg (PG-19)
- 特征:
text: stringmeta: structpile_set_name: string
domain: string
- 分割:
train: 102759663 bytes, 256 examplestest: 8678842 bytes, 21 examples
- 下载大小: 137454348 bytes
- 数据集大小: 111438505 bytes
HackerNews
- 特征:
text: stringmeta: structpile_set_name: string
domain: string
- 分割:
train: 79788458 bytes, 15917 examplestest: 2540857 bytes, 493 examples
- 下载大小: 103554428 bytes
- 数据集大小: 82329315 bytes
NIH ExPorter
- 特征:
text: stringmeta: structpile_set_name: string
domain: string
- 分割:
train: 39126793 bytes, 18002 examplestest: 1222934 bytes, 557 examples
- 下载大小: 43770700 bytes
- 数据集大小: 40349727 bytes
OpenSubtitles
- 特征:
text: stringmeta: structpile_set_name: string
domain: string
- 分割:
train: 104735849 bytes, 3389 examplestest: 6108121 bytes, 199 examples
- 下载大小: 130099164 bytes
- 数据集大小: 110843970 bytes
OpenWebText2
- 特征:
text: stringmeta: structpile_set_name: string
domain: string
- 分割:
train: 106039804 bytes, 26421 examplestest: 39817479 bytes, 10125 examples
- 下载大小: 178103310 bytes
- 数据集大小: 145857283 bytes
PhilPapers
- 特征:
text: stringmeta: structpile_set_name: string
domain: string
- 分割:
train: 47970203 bytes, 647 examplestest: 1538027 bytes, 21 examples
- 下载大小: 52345194 bytes
- 数据集大小: 49508230 bytes
Pile-CC
- 特征:
text: stringmeta: structpile_set_name: string
domain: string
- 分割:
train: 103171504 bytes, 23300 examplestest: 72105538 bytes, 16421 examples
- 下载大小: 208522136 bytes
- 数据集大小: 175277042 bytes
PubMed Abstracts
- 特征:
text: stringmeta: structpile_set_name: string
domain: string
- 分割:
train: 104936477 bytes, 76172 examplestest: 12600783 bytes, 9185 examples
- 下载大小: 129554148 bytes
- 数据集大小: 117537260 bytes
PubMed Central
- 特征:
text: stringmeta: structpile_set_name: string
domain: string
- 分割:
train: 102994643 bytes, 3273 examplestest: 56626731 bytes, 1779 examples
- 下载大小: 148298044 bytes
- 数据集大小: 159621374 bytes
StackExchange
- 特征:
text: stringmeta: structpile_set_name: string
domain: string
- 分割:
train: 103761034 bytes, 46841 examplestest: 20466176 bytes, 9247 examples
- 下载大小: 129854838 bytes
- 数据集大小: 124227210 bytes
UPSTO Backgrounds
- 特征:
text: stringmeta: structpile_set_name: string
domain: string
- 分割:
train: 104750819 bytes, 24632 examplestest: 14799000 bytes, 3484 examples
- 下载大小: 113457922 bytes
- 数据集大小: 119549819 bytes
USPTO Backgrounds
- 特征:
text: stringmeta: structpile_set_name: string
domain: string
- 分割:
train: 104750819 bytes, 24632 examplestest: 14799000 bytes, 3484 examples
- 下载大小: 113457922 bytes
- 数据集大小: 119549819 bytes
Ubuntu IRC
- 特征:
text: stringmeta: structpile_set_name: string
domain: string
- 分割:
train: 105150097 bytes, 203 examplestest: 324716 bytes, 7 examples
- 下载大小: 111839542 bytes
- 数据集大小: 105474813 bytes
Wikipedia (en)
- 特征:
text: stringmeta: structpile_set_name: string
domain: string
- 分割:
train: 105309044 bytes, 34309 examplestest: 16269639 bytes, 5305 examples
- 下载大小: 140418662 bytes
- 数据集大小: 121578683 bytes
YoutubeSubtitles
- 特征:
text: stringmeta: structpile_set_name: string
domain: string
- 分割:
train: 79086408 bytes, 3322 examplestest: 1373528 bytes, 103 examples
- 下载大小: 99103806 bytes
- 数据集大小: 80459936 bytes



