spachava/openwebtext
收藏Hugging Face2023-11-13 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/spachava/openwebtext
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
dataset_info:
- config_name: gpt2-1024
features:
- name: input_ids
sequence: int32
- name: attention_mask
sequence: int8
- name: labels
sequence: int64
splits:
- name: train
num_bytes: 111601610816
num_examples: 8375984
- name: validation
num_bytes: 5911299192
num_examples: 443658
download_size: 34879494010
dataset_size: 117512910008
- config_name: gpt2-tokenized
features:
- name: input_ids
sequence: int32
- name: attention_mask
sequence: int8
splits:
- name: train
num_bytes: 42949162328
num_examples: 7613081
- name: validation
num_bytes: 2274964454
num_examples: 400688
download_size: 17282982051
dataset_size: 45224126782
configs:
- config_name: gpt2-1024
data_files:
- split: train
path: gpt2-1024/train-*
- split: validation
path: gpt2-1024/validation-*
- config_name: gpt2-tokenized
data_files:
- split: train
path: gpt2-tokenized/train-*
- split: validation
path: gpt2-tokenized/validation-*
---
提供机构:
spachava
原始信息汇总
数据集概述
许可证
- MIT
数据集配置
配置名称:gpt2-1024
- 特征
input_ids: 序列类型为int32attention_mask: 序列类型为int8labels: 序列类型为int64
- 分割
train- 字节数: 111601610816
- 样本数: 8375984
validation- 字节数: 5911299192
- 样本数: 443658
- 下载大小: 34879494010
- 数据集大小: 117512910008
- 数据文件
train:gpt2-1024/train-*validation:gpt2-1024/validation-*
配置名称:gpt2-tokenized
- 特征
input_ids: 序列类型为int32attention_mask: 序列类型为int8
- 分割
train- 字节数: 42949162328
- 样本数: 7613081
validation- 字节数: 2274964454
- 样本数: 400688
- 下载大小: 17282982051
- 数据集大小: 45224126782
- 数据文件
train:gpt2-tokenized/train-*validation:gpt2-tokenized/validation-*



