answerdotai/dolma_20bn_stratified_sample
收藏Hugging Face2024-05-29 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/answerdotai/dolma_20bn_stratified_sample
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: id
dtype: string
- name: text
dtype: string
- name: added
dtype: string
- name: created
dtype: string
- name: source
dtype: string
- name: original_shard_dir
dtype: string
- name: original_shard_idx
dtype: int64
- name: num_tokens
dtype: int64
splits:
- name: shard_0
num_bytes: 10023532839
num_examples: 3134276
- name: shard_1
num_bytes: 10085303199
num_examples: 2719043
- name: shard_2
num_bytes: 10081626584
num_examples: 2784699
- name: shard_3
num_bytes: 10033349618
num_examples: 2716178
- name: shard_4
num_bytes: 10032658971
num_examples: 2910157
- name: shard_5
num_bytes: 10000756575
num_examples: 3289393
- name: shard_6
num_bytes: 10020183196
num_examples: 3534969
- name: shard_7
num_bytes: 10030070745
num_examples: 3447157
- name: shard_8
num_bytes: 10001543749
num_examples: 3183587
- name: shard_9
num_bytes: 10036595084
num_examples: 3155070
- name: shard_10
num_bytes: 10057228949
num_examples: 3302254
- name: shard_11
num_bytes: 10038785525
num_examples: 4914441
- name: shard_12
num_bytes: 6388567983
num_examples: 2201242
download_size: 73425935394
dataset_size: 126830203017
configs:
- config_name: default
data_files:
- split: shard_0
path: data/shard_0-*
- split: shard_1
path: data/shard_1-*
- split: shard_2
path: data/shard_2-*
- split: shard_3
path: data/shard_3-*
- split: shard_4
path: data/shard_4-*
- split: shard_5
path: data/shard_5-*
- split: shard_6
path: data/shard_6-*
- split: shard_7
path: data/shard_7-*
- split: shard_8
path: data/shard_8-*
- split: shard_9
path: data/shard_9-*
- split: shard_10
path: data/shard_10-*
- split: shard_11
path: data/shard_11-*
- split: shard_12
path: data/shard_12-*
---
提供机构:
answerdotai
原始信息汇总
数据集概述
数据集特征
- id:字符串类型
- text:字符串类型
- added:字符串类型
- created:字符串类型
- source:字符串类型
- original_shard_dir:字符串类型
- original_shard_idx:整数类型(int64)
- num_tokens:整数类型(int64)
数据集分割
- shard_0:
- 大小:10023532839字节
- 示例数:3134276
- shard_1:
- 大小:10085303199字节
- 示例数:2719043
- shard_2:
- 大小:10081626584字节
- 示例数:2784699
- shard_3:
- 大小:10033349618字节
- 示例数:2716178
- shard_4:
- 大小:10032658971字节
- 示例数:2910157
- shard_5:
- 大小:10000756575字节
- 示例数:3289393
- shard_6:
- 大小:10020183196字节
- 示例数:3534969
- shard_7:
- 大小:10030070745字节
- 示例数:3447157
- shard_8:
- 大小:10001543749字节
- 示例数:3183587
- shard_9:
- 大小:10036595084字节
- 示例数:3155070
- shard_10:
- 大小:10057228949字节
- 示例数:3302254
- shard_11:
- 大小:10038785525字节
- 示例数:4914441
- shard_12:
- 大小:6388567983字节
- 示例数:2201242
数据集大小
- 下载大小:73425935394字节
- 数据集总大小:126830203017字节
配置文件
- 默认配置(config_name: default)包含所有分割的数据文件路径,每个分割对应一个数据文件路径。



