five

iNeil77/pseudo-mini-pile

收藏
Hugging Face2024-03-09 更新2024-06-22 收录
下载链接:
https://hf-mirror.com/datasets/iNeil77/pseudo-mini-pile
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: - config_name: all features: - name: content dtype: string splits: - name: train num_bytes: 360187653412.6177 num_examples: 56194997 download_size: 199030076349 dataset_size: 360187653412.6177 - config_name: c4_realnews features: - name: content dtype: string splits: - name: train num_bytes: 31597106256.723488 num_examples: 11427438 download_size: 19889880484 dataset_size: 31597106256.723488 - config_name: openwebtext features: - name: content dtype: string splits: - name: train num_bytes: 30974178275.039234 num_examples: 6474479 download_size: 19069709415 dataset_size: 30974178275.039234 - config_name: peS2o features: - name: content dtype: string splits: - name: train num_bytes: 221900508006.5479 num_examples: 32612199 download_size: 116217303065 dataset_size: 221900508006.5479 - config_name: redpajama_books features: - name: content dtype: string splits: - name: train num_bytes: 49246538575.26426 num_examples: 107443 download_size: 29612204926 dataset_size: 49246538575.26426 - config_name: stackexchange features: - name: content dtype: string splits: - name: train num_bytes: 2034535930.2150385 num_examples: 716532 download_size: 1222605537 dataset_size: 2034535930.2150385 - config_name: uspto features: - name: content dtype: string splits: - name: train num_bytes: 14755999149.910166 num_examples: 3247716 download_size: 7058272149 dataset_size: 14755999149.910166 - config_name: wiki features: - name: content dtype: string splits: - name: train num_bytes: 7528525537.163156 num_examples: 1609190 download_size: 4593971902 dataset_size: 7528525537.163156 configs: - config_name: all data_files: - split: train path: all/train-* - config_name: c4_realnews data_files: - split: train path: c4_realnews/train-* - config_name: openwebtext data_files: - split: train path: openwebtext/train-* - config_name: peS2o data_files: - split: train path: peS2o/train-* - config_name: redpajama_books data_files: - split: train path: redpajama_books/train-* - config_name: stackexchange data_files: - split: train path: stackexchange/train-* - config_name: uspto data_files: - split: train path: uspto/train-* - config_name: wiki data_files: - split: train path: wiki/train-* task_categories: - text-generation language: - en size_categories: - 10M<n<100M --- A small, aggressively cleaned and de-duped pre-training corpus for academic settings. It aims to recreate something akin to [The Pile](https://huggingface.co/datasets/EleutherAI/pile) but prioritizes quality for the constrained token budget academic researchers live with. It has seven config subsets and an eighth `all` subset that combines them for a total of ~91B tokens (GPT2 Tokenizer estimate). These splits are as follows: 1. `c4_realnews`: The RealNews domain subset of the C4 dataset containing news articles. 2. `openwebtext`: The OpenWebText dataset containing the contents of the links mentioned in Reddit posts with at least 3 upvotes. 3. `peS2o`: The PeS2o dataset containing academic articles from Semantic Scholar. 4. `redpajama_books`: The books subset of RedPajama V1. 5. `stackexchange`: The EN StackExchange non-code subset of the BigScience ROOTs dataset. 6. `uspto`: The EN USPTO patent applications contents' subset of the BigScience ROOTs dataset. 7. `wiki`: The EN Wiki subset of the BigScience ROOTs dataset. The following processing and filtering steps have been applied: 1. Removed citation text and bibliography information for academic texts. 2. Ran a perplexity filter using a KenLM model trained on the EN OSCAR corpus and removed documents with a perplexity of more than 325 and less than 7. 3. Removed samples which have a repeating <=4-gram proportion of 15%. 4. Removed samples which have lower than 99% confidence of being EN using the lingua language detector. 5. Performed an aggressive MinHash de-dupe using a shingle size of 8 and a low threshold of 0.5.
提供机构:
iNeil77
原始信息汇总

数据集概述

数据集配置

1. all

  • 特征:
    • content: 类型为 string
  • 分割:
    • train: 包含 56,194,997 个样本,占用 360,187,653,412.6177 字节
  • 下载大小: 199,030,076,349 字节
  • 数据集大小: 360,187,653,412.6177 字节

2. c4_realnews

  • 特征:
    • content: 类型为 string
  • 分割:
    • train: 包含 11,427,438 个样本,占用 31,597,106,256.723488 字节
  • 下载大小: 19,889,880,484 字节
  • 数据集大小: 31,597,106,256.723488 字节

3. openwebtext

  • 特征:
    • content: 类型为 string
  • 分割:
    • train: 包含 6,474,479 个样本,占用 30,974,178,275.039234 字节
  • 下载大小: 19,069,709,415 字节
  • 数据集大小: 30,974,178,275.039234 字节

4. peS2o

  • 特征:
    • content: 类型为 string
  • 分割:
    • train: 包含 32,612,199 个样本,占用 221,900,508,006.5479 字节
  • 下载大小: 116,217,303,065 字节
  • 数据集大小: 221,900,508,006.5479 字节

5. redpajama_books

  • 特征:
    • content: 类型为 string
  • 分割:
    • train: 包含 107,443 个样本,占用 49,246,538,575.26426 字节
  • 下载大小: 29,612,204,926 字节
  • 数据集大小: 49,246,538,575.26426 字节

6. stackexchange

  • 特征:
    • content: 类型为 string
  • 分割:
    • train: 包含 716,532 个样本,占用 2,034,535,930.2150385 字节
  • 下载大小: 1,222,605,537 字节
  • 数据集大小: 2,034,535,930.2150385 字节

7. uspto

  • 特征:
    • content: 类型为 string
  • 分割:
    • train: 包含 3,247,716 个样本,占用 14,755,999,149.910166 字节
  • 下载大小: 7,058,272,149 字节
  • 数据集大小: 14,755,999,149.910166 字节

8. wiki

  • 特征:
    • content: 类型为 string
  • 分割:
    • train: 包含 1,609,190 个样本,占用 7,528,525,537.163156 字节
  • 下载大小: 4,593,971,902 字节
  • 数据集大小: 7,528,525,537.163156 字节

数据文件路径

  • all: all/train-*
  • c4_realnews: c4_realnews/train-*
  • openwebtext: openwebtext/train-*
  • peS2o: peS2o/train-*
  • redpajama_books: redpajama_books/train-*
  • stackexchange: stackexchange/train-*
  • uspto: uspto/train-*
  • wiki: wiki/train-*

任务类别

  • 文本生成

语言

  • 英语

大小类别

  • 10M<n<100M
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作