iNeil77/pseudo-mini-pile
收藏Hugging Face2024-03-09 更新2024-06-22 收录
下载链接:
https://hf-mirror.com/datasets/iNeil77/pseudo-mini-pile
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
- config_name: all
features:
- name: content
dtype: string
splits:
- name: train
num_bytes: 360187653412.6177
num_examples: 56194997
download_size: 199030076349
dataset_size: 360187653412.6177
- config_name: c4_realnews
features:
- name: content
dtype: string
splits:
- name: train
num_bytes: 31597106256.723488
num_examples: 11427438
download_size: 19889880484
dataset_size: 31597106256.723488
- config_name: openwebtext
features:
- name: content
dtype: string
splits:
- name: train
num_bytes: 30974178275.039234
num_examples: 6474479
download_size: 19069709415
dataset_size: 30974178275.039234
- config_name: peS2o
features:
- name: content
dtype: string
splits:
- name: train
num_bytes: 221900508006.5479
num_examples: 32612199
download_size: 116217303065
dataset_size: 221900508006.5479
- config_name: redpajama_books
features:
- name: content
dtype: string
splits:
- name: train
num_bytes: 49246538575.26426
num_examples: 107443
download_size: 29612204926
dataset_size: 49246538575.26426
- config_name: stackexchange
features:
- name: content
dtype: string
splits:
- name: train
num_bytes: 2034535930.2150385
num_examples: 716532
download_size: 1222605537
dataset_size: 2034535930.2150385
- config_name: uspto
features:
- name: content
dtype: string
splits:
- name: train
num_bytes: 14755999149.910166
num_examples: 3247716
download_size: 7058272149
dataset_size: 14755999149.910166
- config_name: wiki
features:
- name: content
dtype: string
splits:
- name: train
num_bytes: 7528525537.163156
num_examples: 1609190
download_size: 4593971902
dataset_size: 7528525537.163156
configs:
- config_name: all
data_files:
- split: train
path: all/train-*
- config_name: c4_realnews
data_files:
- split: train
path: c4_realnews/train-*
- config_name: openwebtext
data_files:
- split: train
path: openwebtext/train-*
- config_name: peS2o
data_files:
- split: train
path: peS2o/train-*
- config_name: redpajama_books
data_files:
- split: train
path: redpajama_books/train-*
- config_name: stackexchange
data_files:
- split: train
path: stackexchange/train-*
- config_name: uspto
data_files:
- split: train
path: uspto/train-*
- config_name: wiki
data_files:
- split: train
path: wiki/train-*
task_categories:
- text-generation
language:
- en
size_categories:
- 10M<n<100M
---
A small, aggressively cleaned and de-duped pre-training corpus for academic settings. It aims to recreate something akin to [The Pile](https://huggingface.co/datasets/EleutherAI/pile) but prioritizes quality for the constrained token budget academic researchers live with.
It has seven config subsets and an eighth `all` subset that combines them for a total of ~91B tokens (GPT2 Tokenizer estimate). These splits are as follows:
1. `c4_realnews`: The RealNews domain subset of the C4 dataset containing news articles.
2. `openwebtext`: The OpenWebText dataset containing the contents of the links mentioned in Reddit posts with at least 3 upvotes.
3. `peS2o`: The PeS2o dataset containing academic articles from Semantic Scholar.
4. `redpajama_books`: The books subset of RedPajama V1.
5. `stackexchange`: The EN StackExchange non-code subset of the BigScience ROOTs dataset.
6. `uspto`: The EN USPTO patent applications contents' subset of the BigScience ROOTs dataset.
7. `wiki`: The EN Wiki subset of the BigScience ROOTs dataset.
The following processing and filtering steps have been applied:
1. Removed citation text and bibliography information for academic texts.
2. Ran a perplexity filter using a KenLM model trained on the EN OSCAR corpus and removed documents with a perplexity of more than 325 and less than 7.
3. Removed samples which have a repeating <=4-gram proportion of 15%.
4. Removed samples which have lower than 99% confidence of being EN using the lingua language detector.
5. Performed an aggressive MinHash de-dupe using a shingle size of 8 and a low threshold of 0.5.
提供机构:
iNeil77
原始信息汇总
数据集概述
数据集配置
1. all
- 特征:
content: 类型为string
- 分割:
train: 包含 56,194,997 个样本,占用 360,187,653,412.6177 字节
- 下载大小: 199,030,076,349 字节
- 数据集大小: 360,187,653,412.6177 字节
2. c4_realnews
- 特征:
content: 类型为string
- 分割:
train: 包含 11,427,438 个样本,占用 31,597,106,256.723488 字节
- 下载大小: 19,889,880,484 字节
- 数据集大小: 31,597,106,256.723488 字节
3. openwebtext
- 特征:
content: 类型为string
- 分割:
train: 包含 6,474,479 个样本,占用 30,974,178,275.039234 字节
- 下载大小: 19,069,709,415 字节
- 数据集大小: 30,974,178,275.039234 字节
4. peS2o
- 特征:
content: 类型为string
- 分割:
train: 包含 32,612,199 个样本,占用 221,900,508,006.5479 字节
- 下载大小: 116,217,303,065 字节
- 数据集大小: 221,900,508,006.5479 字节
5. redpajama_books
- 特征:
content: 类型为string
- 分割:
train: 包含 107,443 个样本,占用 49,246,538,575.26426 字节
- 下载大小: 29,612,204,926 字节
- 数据集大小: 49,246,538,575.26426 字节
6. stackexchange
- 特征:
content: 类型为string
- 分割:
train: 包含 716,532 个样本,占用 2,034,535,930.2150385 字节
- 下载大小: 1,222,605,537 字节
- 数据集大小: 2,034,535,930.2150385 字节
7. uspto
- 特征:
content: 类型为string
- 分割:
train: 包含 3,247,716 个样本,占用 14,755,999,149.910166 字节
- 下载大小: 7,058,272,149 字节
- 数据集大小: 14,755,999,149.910166 字节
8. wiki
- 特征:
content: 类型为string
- 分割:
train: 包含 1,609,190 个样本,占用 7,528,525,537.163156 字节
- 下载大小: 4,593,971,902 字节
- 数据集大小: 7,528,525,537.163156 字节
数据文件路径
all:all/train-*c4_realnews:c4_realnews/train-*openwebtext:openwebtext/train-*peS2o:peS2o/train-*redpajama_books:redpajama_books/train-*stackexchange:stackexchange/train-*uspto:uspto/train-*wiki:wiki/train-*
任务类别
- 文本生成
语言
- 英语
大小类别
- 10M<n<100M



