BEE-spoke-data/Long-Data-Col-rp_pile_pretrain

Name: BEE-spoke-data/Long-Data-Col-rp_pile_pretrain
Creator: BEE-spoke-data
Published: 2023-10-26 02:01:57
License: 暂无描述

Hugging Face2023-10-26 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/BEE-spoke-data/Long-Data-Col-rp_pile_pretrain

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: other size_categories: - 1M<n<10M source_datasets: togethercomputer/Long-Data-Collections task_categories: - text-generation - fill-mask - feature-extraction configs: - config_name: cleaned data_files: - split: train path: cleaned/train-* - config_name: cleaned-dedup data_files: - split: train path: cleaned-dedup/train-* - config_name: cleaned-dedup-en data_files: - split: train path: cleaned-dedup-en/train-* - config_name: default data_files: - split: train path: data/train-* dataset_info: - config_name: cleaned features: - name: text dtype: string - name: meta dtype: string splits: - name: train num_bytes: 16969436991 num_examples: 2759555 download_size: 9521997027 dataset_size: 16969436991 - config_name: cleaned-dedup features: - name: text dtype: string - name: meta dtype: string splits: - name: train num_bytes: 13009681081 num_examples: 2712907 download_size: 7319241627 dataset_size: 13009681081 - config_name: cleaned-dedup-en features: - name: text dtype: string - name: meta dtype: string splits: - name: train num_bytes: 12723856310.202166 num_examples: 2653304 download_size: 7180653999 dataset_size: 12723856310.202166 - config_name: default features: - name: text dtype: string - name: meta dtype: string splits: - name: train num_bytes: 16821991568.354612 num_examples: 2759555 download_size: 9685120636 dataset_size: 16821991568.354612 tags: - long boi --- # Dataset Card for "Long-Data-Col-rp_pile_pretrain" This dataset is a subset of [togethercomputer/Long-Data-Collections](https://huggingface.co/datasets/togethercomputer/Long-Data-Collections), namely the `rp_sub.jsonl.zst` and `pile_sub.jsonl.zst` files from the `pretrain` split. Like the source dataset, we do not attempt to modify/change licenses of underlying data. Refer to the source dataset (and its source datasets) for details. ## changes 1. as this is supposed to be a "long text dataset", we drop all rows where `text` contains <= 250 characters. This drops approx 100k rows from the raw data. Resulting stats are below. | | text_len | |:------|----------------:| | count | 2.75956e+06 | | mean | 6195.11 | | std | 56364.9 | | min | 251 | | 25% | 1102 | | 50% | 2147 | | 75% | 4762 | | max | 4.66452e+07 | ---

提供机构：

BEE-spoke-data

原始信息汇总

数据集概述

数据集基本信息

许可证: other
大小类别: 1M<n<10M
源数据集: togethercomputer/Long-Data-Collections
任务类别:
- 文本生成
- 填充掩码
- 特征提取

配置信息

config_name: cleaned
- 数据文件:
  - 分割: train
  - 路径: cleaned/train-*
- 特征:
  - 名称: text
    - 数据类型: string
  - 名称: meta
    - 数据类型: string
- 分割:
  - 名称: train
    - 字节数: 16969436991
    - 示例数: 2759555
- 下载大小: 9521997027
- 数据集大小: 16969436991
config_name: cleaned-dedup
- 数据文件:
  - 分割: train
  - 路径: cleaned-dedup/train-*
- 特征:
  - 名称: text
    - 数据类型: string
  - 名称: meta
    - 数据类型: string
- 分割:
  - 名称: train
    - 字节数: 13009681081
    - 示例数: 2712907
- 下载大小: 7319241627
- 数据集大小: 13009681081
config_name: cleaned-dedup-en
- 数据文件:
  - 分割: train
  - 路径: cleaned-dedup-en/train-*
- 特征:
  - 名称: text
    - 数据类型: string
  - 名称: meta
    - 数据类型: string
- 分割:
  - 名称: train
    - 字节数: 12723856310.202166
    - 示例数: 2653304
- 下载大小: 7180653999
- 数据集大小: 12723856310.202166
config_name: default
- 数据文件:
  - 分割: train
  - 路径: data/train-*
- 特征:
  - 名称: text
    - 数据类型: string
  - 名称: meta
    - 数据类型: string
- 分割:
  - 名称: train
    - 字节数: 16821991568.354612
    - 示例数: 2759555
- 下载大小: 9685120636
- 数据集大小: 16821991568.354612

数据集处理

处理说明: 该数据集旨在包含长文本，因此删除了所有text字段包含少于或等于250个字符的行。这大约从原始数据中删除了10万行。
处理后的统计信息:
- count: 2.75956e+06
- mean: 6195.11
- std: 56364.9
- min: 251
- 25%: 1102
- 50%: 2147
- 75%: 4762
- max: 4.66452e+07

5,000+

优质数据集

54 个

任务类型

进入经典数据集