orionweller/dolma_20bn_prop_stratified_sample
收藏Hugging Face2024-06-18 更新2024-06-29 收录
下载链接:
https://hf-mirror.com/datasets/orionweller/dolma_20bn_prop_stratified_sample
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: id
dtype: string
- name: text
dtype: string
- name: added
dtype: string
- name: created
dtype: string
- name: source
dtype: string
- name: original_shard_dir
dtype: string
- name: original_shard_idx
dtype: int64
- name: num_tokens
dtype: int64
splits:
- name: shard_0
num_bytes: 10004058054
num_examples: 2891134
- name: shard_1
num_bytes: 10115014699
num_examples: 3107821
- name: shard_2
num_bytes: 10003965839
num_examples: 2767485
- name: shard_3
num_bytes: 10105104758
num_examples: 2780370
- name: shard_4
num_bytes: 10076669237
num_examples: 2703160
- name: shard_5
num_bytes: 10070453626
num_examples: 2801525
- name: shard_6
num_bytes: 10026306308
num_examples: 2894785
- name: shard_7
num_bytes: 10055888005
num_examples: 2961433
- name: shard_8
num_bytes: 10045373058
num_examples: 3773801
download_size: 54067885359
dataset_size: 90502833584
configs:
- config_name: default
data_files:
- split: shard_0
path: data/shard_0-*
- split: shard_1
path: data/shard_1-*
- split: shard_2
path: data/shard_2-*
- split: shard_3
path: data/shard_3-*
- split: shard_4
path: data/shard_4-*
- split: shard_5
path: data/shard_5-*
- split: shard_6
path: data/shard_6-*
- split: shard_7
path: data/shard_7-*
- split: shard_8
path: data/shard_8-*
---
提供机构:
orionweller
原始信息汇总
数据集概述
数据集特征
- id: 字符串类型
- text: 字符串类型
- added: 字符串类型
- created: 字符串类型
- source: 字符串类型
- original_shard_dir: 字符串类型
- original_shard_idx: 整数类型 (int64)
- num_tokens: 整数类型 (int64)
数据集分片
- shard_0:
- 字节数: 10004058054
- 样本数: 2891134
- shard_1:
- 字节数: 10115014699
- 样本数: 3107821
- shard_2:
- 字节数: 10003965839
- 样本数: 2767485
- shard_3:
- 字节数: 10105104758
- 样本数: 2780370
- shard_4:
- 字节数: 10076669237
- 样本数: 2703160
- shard_5:
- 字节数: 10070453626
- 样本数: 2801525
- shard_6:
- 字节数: 10026306308
- 样本数: 2894785
- shard_7:
- 字节数: 10055888005
- 样本数: 2961433
- shard_8:
- 字节数: 10045373058
- 样本数: 3773801
数据集大小
- 下载大小: 54067885359 字节
- 数据集大小: 90502833584 字节
配置
- config_name: default
- data_files:
- split: shard_0
- path: data/shard_0-*
- split: shard_1
- path: data/shard_1-*
- split: shard_2
- path: data/shard_2-*
- split: shard_3
- path: data/shard_3-*
- split: shard_4
- path: data/shard_4-*
- split: shard_5
- path: data/shard_5-*
- split: shard_6
- path: data/shard_6-*
- split: shard_7
- path: data/shard_7-*
- split: shard_8
- path: data/shard_8-*
- split: shard_0
- data_files:



