BEE-spoke-data/Long-Data-Col-rp_pile_pretrain
收藏Hugging Face2023-10-26 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/BEE-spoke-data/Long-Data-Col-rp_pile_pretrain
下载链接
链接失效反馈官方服务:
资源简介:
---
license: other
size_categories:
- 1M<n<10M
source_datasets: togethercomputer/Long-Data-Collections
task_categories:
- text-generation
- fill-mask
- feature-extraction
configs:
- config_name: cleaned
data_files:
- split: train
path: cleaned/train-*
- config_name: cleaned-dedup
data_files:
- split: train
path: cleaned-dedup/train-*
- config_name: cleaned-dedup-en
data_files:
- split: train
path: cleaned-dedup-en/train-*
- config_name: default
data_files:
- split: train
path: data/train-*
dataset_info:
- config_name: cleaned
features:
- name: text
dtype: string
- name: meta
dtype: string
splits:
- name: train
num_bytes: 16969436991
num_examples: 2759555
download_size: 9521997027
dataset_size: 16969436991
- config_name: cleaned-dedup
features:
- name: text
dtype: string
- name: meta
dtype: string
splits:
- name: train
num_bytes: 13009681081
num_examples: 2712907
download_size: 7319241627
dataset_size: 13009681081
- config_name: cleaned-dedup-en
features:
- name: text
dtype: string
- name: meta
dtype: string
splits:
- name: train
num_bytes: 12723856310.202166
num_examples: 2653304
download_size: 7180653999
dataset_size: 12723856310.202166
- config_name: default
features:
- name: text
dtype: string
- name: meta
dtype: string
splits:
- name: train
num_bytes: 16821991568.354612
num_examples: 2759555
download_size: 9685120636
dataset_size: 16821991568.354612
tags:
- long boi
---
# Dataset Card for "Long-Data-Col-rp_pile_pretrain"
This dataset is a subset of [togethercomputer/Long-Data-Collections](https://huggingface.co/datasets/togethercomputer/Long-Data-Collections), namely the `rp_sub.jsonl.zst` and `pile_sub.jsonl.zst` files from the `pretrain` split.
Like the source dataset, we do not attempt to modify/change licenses of underlying data. Refer to the source dataset (and its source datasets) for details.
## changes
1. as this is supposed to be a "long text dataset", we drop all rows where `text` contains <= 250 characters. This drops approx 100k rows from the raw data. Resulting stats are below.
| | text_len |
|:------|----------------:|
| count | 2.75956e+06 |
| mean | 6195.11 |
| std | 56364.9 |
| min | 251 |
| 25% | 1102 |
| 50% | 2147 |
| 75% | 4762 |
| max | 4.66452e+07 |
---
提供机构:
BEE-spoke-data
原始信息汇总
数据集概述
数据集基本信息
- 许可证: other
- 大小类别: 1M<n<10M
- 源数据集: togethercomputer/Long-Data-Collections
- 任务类别:
- 文本生成
- 填充掩码
- 特征提取
配置信息
-
config_name: cleaned
- 数据文件:
- 分割: train
- 路径: cleaned/train-*
- 特征:
- 名称: text
- 数据类型: string
- 名称: meta
- 数据类型: string
- 名称: text
- 分割:
- 名称: train
- 字节数: 16969436991
- 示例数: 2759555
- 名称: train
- 下载大小: 9521997027
- 数据集大小: 16969436991
- 数据文件:
-
config_name: cleaned-dedup
- 数据文件:
- 分割: train
- 路径: cleaned-dedup/train-*
- 特征:
- 名称: text
- 数据类型: string
- 名称: meta
- 数据类型: string
- 名称: text
- 分割:
- 名称: train
- 字节数: 13009681081
- 示例数: 2712907
- 名称: train
- 下载大小: 7319241627
- 数据集大小: 13009681081
- 数据文件:
-
config_name: cleaned-dedup-en
- 数据文件:
- 分割: train
- 路径: cleaned-dedup-en/train-*
- 特征:
- 名称: text
- 数据类型: string
- 名称: meta
- 数据类型: string
- 名称: text
- 分割:
- 名称: train
- 字节数: 12723856310.202166
- 示例数: 2653304
- 名称: train
- 下载大小: 7180653999
- 数据集大小: 12723856310.202166
- 数据文件:
-
config_name: default
- 数据文件:
- 分割: train
- 路径: data/train-*
- 特征:
- 名称: text
- 数据类型: string
- 名称: meta
- 数据类型: string
- 名称: text
- 分割:
- 名称: train
- 字节数: 16821991568.354612
- 示例数: 2759555
- 名称: train
- 下载大小: 9685120636
- 数据集大小: 16821991568.354612
- 数据文件:
数据集处理
- 处理说明: 该数据集旨在包含长文本,因此删除了所有
text字段包含少于或等于250个字符的行。这大约从原始数据中删除了10万行。 - 处理后的统计信息:
- count: 2.75956e+06
- mean: 6195.11
- std: 56364.9
- min: 251
- 25%: 1102
- 50%: 2147
- 75%: 4762
- max: 4.66452e+07



