iNeil77/pseudo-mini-pile

Name: iNeil77/pseudo-mini-pile
Creator: iNeil77
Published: 2024-03-09 17:25:32
License: 暂无描述

Hugging Face2024-03-09 更新2024-06-22 收录

下载链接：

https://hf-mirror.com/datasets/iNeil77/pseudo-mini-pile

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: - config_name: all features: - name: content dtype: string splits: - name: train num_bytes: 360187653412.6177 num_examples: 56194997 download_size: 199030076349 dataset_size: 360187653412.6177 - config_name: c4_realnews features: - name: content dtype: string splits: - name: train num_bytes: 31597106256.723488 num_examples: 11427438 download_size: 19889880484 dataset_size: 31597106256.723488 - config_name: openwebtext features: - name: content dtype: string splits: - name: train num_bytes: 30974178275.039234 num_examples: 6474479 download_size: 19069709415 dataset_size: 30974178275.039234 - config_name: peS2o features: - name: content dtype: string splits: - name: train num_bytes: 221900508006.5479 num_examples: 32612199 download_size: 116217303065 dataset_size: 221900508006.5479 - config_name: redpajama_books features: - name: content dtype: string splits: - name: train num_bytes: 49246538575.26426 num_examples: 107443 download_size: 29612204926 dataset_size: 49246538575.26426 - config_name: stackexchange features: - name: content dtype: string splits: - name: train num_bytes: 2034535930.2150385 num_examples: 716532 download_size: 1222605537 dataset_size: 2034535930.2150385 - config_name: uspto features: - name: content dtype: string splits: - name: train num_bytes: 14755999149.910166 num_examples: 3247716 download_size: 7058272149 dataset_size: 14755999149.910166 - config_name: wiki features: - name: content dtype: string splits: - name: train num_bytes: 7528525537.163156 num_examples: 1609190 download_size: 4593971902 dataset_size: 7528525537.163156 configs: - config_name: all data_files: - split: train path: all/train-* - config_name: c4_realnews data_files: - split: train path: c4_realnews/train-* - config_name: openwebtext data_files: - split: train path: openwebtext/train-* - config_name: peS2o data_files: - split: train path: peS2o/train-* - config_name: redpajama_books data_files: - split: train path: redpajama_books/train-* - config_name: stackexchange data_files: - split: train path: stackexchange/train-* - config_name: uspto data_files: - split: train path: uspto/train-* - config_name: wiki data_files: - split: train path: wiki/train-* task_categories: - text-generation language: - en size_categories: - 10M<n<100M --- A small, aggressively cleaned and de-duped pre-training corpus for academic settings. It aims to recreate something akin to [The Pile](https://huggingface.co/datasets/EleutherAI/pile) but prioritizes quality for the constrained token budget academic researchers live with. It has seven config subsets and an eighth `all` subset that combines them for a total of ~91B tokens (GPT2 Tokenizer estimate). These splits are as follows: 1. `c4_realnews`: The RealNews domain subset of the C4 dataset containing news articles. 2. `openwebtext`: The OpenWebText dataset containing the contents of the links mentioned in Reddit posts with at least 3 upvotes. 3. `peS2o`: The PeS2o dataset containing academic articles from Semantic Scholar. 4. `redpajama_books`: The books subset of RedPajama V1. 5. `stackexchange`: The EN StackExchange non-code subset of the BigScience ROOTs dataset. 6. `uspto`: The EN USPTO patent applications contents' subset of the BigScience ROOTs dataset. 7. `wiki`: The EN Wiki subset of the BigScience ROOTs dataset. The following processing and filtering steps have been applied: 1. Removed citation text and bibliography information for academic texts. 2. Ran a perplexity filter using a KenLM model trained on the EN OSCAR corpus and removed documents with a perplexity of more than 325 and less than 7. 3. Removed samples which have a repeating <=4-gram proportion of 15%. 4. Removed samples which have lower than 99% confidence of being EN using the lingua language detector. 5. Performed an aggressive MinHash de-dupe using a shingle size of 8 and a low threshold of 0.5.

提供机构：

iNeil77

原始信息汇总

数据集概述

数据集配置

1. `all`

特征:
- content: 类型为 string
分割:
- train: 包含 56,194,997 个样本，占用 360,187,653,412.6177 字节
下载大小: 199,030,076,349 字节
数据集大小: 360,187,653,412.6177 字节

2. `c4_realnews`

特征:
- content: 类型为 string
分割:
- train: 包含 11,427,438 个样本，占用 31,597,106,256.723488 字节
下载大小: 19,889,880,484 字节
数据集大小: 31,597,106,256.723488 字节

3. `openwebtext`

特征:
- content: 类型为 string
分割:
- train: 包含 6,474,479 个样本，占用 30,974,178,275.039234 字节
下载大小: 19,069,709,415 字节
数据集大小: 30,974,178,275.039234 字节

4. `peS2o`

特征:
- content: 类型为 string
分割:
- train: 包含 32,612,199 个样本，占用 221,900,508,006.5479 字节
下载大小: 116,217,303,065 字节
数据集大小: 221,900,508,006.5479 字节

5. `redpajama_books`

特征:
- content: 类型为 string
分割:
- train: 包含 107,443 个样本，占用 49,246,538,575.26426 字节
下载大小: 29,612,204,926 字节
数据集大小: 49,246,538,575.26426 字节

6. `stackexchange`

特征:
- content: 类型为 string
分割:
- train: 包含 716,532 个样本，占用 2,034,535,930.2150385 字节
下载大小: 1,222,605,537 字节
数据集大小: 2,034,535,930.2150385 字节

7. `uspto`

特征:
- content: 类型为 string
分割:
- train: 包含 3,247,716 个样本，占用 14,755,999,149.910166 字节
下载大小: 7,058,272,149 字节
数据集大小: 14,755,999,149.910166 字节

8. `wiki`

特征:
- content: 类型为 string
分割:
- train: 包含 1,609,190 个样本，占用 7,528,525,537.163156 字节
下载大小: 4,593,971,902 字节
数据集大小: 7,528,525,537.163156 字节

数据文件路径

all: all/train-*
c4_realnews: c4_realnews/train-*
openwebtext: openwebtext/train-*
peS2o: peS2o/train-*
redpajama_books: redpajama_books/train-*
stackexchange: stackexchange/train-*
uspto: uspto/train-*
wiki: wiki/train-*

任务类别

文本生成

语言

英语

大小类别

10M<n<100M

5,000+

优质数据集

54 个

任务类型

进入经典数据集

iNeil77/pseudo-mini-pile

数据集概述

数据集配置

1. all

2. c4_realnews

3. openwebtext

4. peS2o

5. redpajama_books

6. stackexchange

7. uspto

8. wiki