orionweller/dolma_18bn_stratified_sample
收藏Hugging Face2024-06-14 更新2024-06-29 收录
下载链接:
https://hf-mirror.com/datasets/orionweller/dolma_18bn_stratified_sample
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: id
dtype: string
- name: text
dtype: string
- name: added
dtype: string
- name: created
dtype: string
- name: source
dtype: string
- name: original_shard_dir
dtype: string
- name: original_shard_idx
dtype: int64
- name: num_tokens
dtype: int64
splits:
- name: shard_0
num_bytes: 10048491187
num_examples: 3080525
- name: shard_1
num_bytes: 10047334747
num_examples: 2746799
- name: shard_2
num_bytes: 10020442845
num_examples: 2725694
- name: shard_3
num_bytes: 10035011475
num_examples: 2846397
- name: shard_4
num_bytes: 10052333787
num_examples: 2900288
- name: shard_5
num_bytes: 10013214029
num_examples: 3707634
- name: shard_6
num_bytes: 10005928795
num_examples: 3365483
- name: shard_7
num_bytes: 10006714170
num_examples: 3241452
- name: shard_8
num_bytes: 10027646826
num_examples: 3158601
- name: shard_9
num_bytes: 10004924092
num_examples: 3869772
- name: shard_10
num_bytes: 10212192755
num_examples: 4096444
- name: shard_11
num_bytes: 3663865968
num_examples: 1424007
download_size: 66080652508
dataset_size: 114138100676
configs:
- config_name: default
data_files:
- split: shard_0
path: data/shard_0-*
- split: shard_1
path: data/shard_1-*
- split: shard_2
path: data/shard_2-*
- split: shard_3
path: data/shard_3-*
- split: shard_4
path: data/shard_4-*
- split: shard_5
path: data/shard_5-*
- split: shard_6
path: data/shard_6-*
- split: shard_7
path: data/shard_7-*
- split: shard_8
path: data/shard_8-*
- split: shard_9
path: data/shard_9-*
- split: shard_10
path: data/shard_10-*
- split: shard_11
path: data/shard_11-*
---
提供机构:
orionweller
原始信息汇总
数据集概述
数据集特征
- id: 字符串类型
- text: 字符串类型
- added: 字符串类型
- created: 字符串类型
- source: 字符串类型
- original_shard_dir: 字符串类型
- original_shard_idx: 整数类型 (int64)
- num_tokens: 整数类型 (int64)
数据集分片
- shard_0:
- 字节数: 10048491187
- 样本数: 3080525
- shard_1:
- 字节数: 10047334747
- 样本数: 2746799
- shard_2:
- 字节数: 10020442845
- 样本数: 2725694
- shard_3:
- 字节数: 10035011475
- 样本数: 2846397
- shard_4:
- 字节数: 10052333787
- 样本数: 2900288
- shard_5:
- 字节数: 10013214029
- 样本数: 3707634
- shard_6:
- 字节数: 10005928795
- 样本数: 3365483
- shard_7:
- 字节数: 10006714170
- 样本数: 3241452
- shard_8:
- 字节数: 10027646826
- 样本数: 3158601
- shard_9:
- 字节数: 10004924092
- 样本数: 3869772
- shard_10:
- 字节数: 10212192755
- 样本数: 4096444
- shard_11:
- 字节数: 3663865968
- 样本数: 1424007
数据集大小
- 下载大小: 66080652508 字节
- 数据集大小: 114138100676 字节
配置
- config_name: default
- data_files:
- split: shard_0
- path: data/shard_0-*
- split: shard_1
- path: data/shard_1-*
- split: shard_2
- path: data/shard_2-*
- split: shard_3
- path: data/shard_3-*
- split: shard_4
- path: data/shard_4-*
- split: shard_5
- path: data/shard_5-*
- split: shard_6
- path: data/shard_6-*
- split: shard_7
- path: data/shard_7-*
- split: shard_8
- path: data/shard_8-*
- split: shard_9
- path: data/shard_9-*
- split: shard_10
- path: data/shard_10-*
- split: shard_11
- path: data/shard_11-*
- split: shard_0
- data_files:



