orionweller/dolma_18bn_prop_stratified_sample
收藏Hugging Face2024-06-18 更新2024-06-29 收录
下载链接:
https://hf-mirror.com/datasets/orionweller/dolma_18bn_prop_stratified_sample
下载链接
链接失效反馈官方服务:
资源简介:
该数据集包含多个分片,每个分片包含大量的文本数据。数据集的特征字段包括id、text、added、created、source、original_shard_dir、original_shard_idx和num_tokens。数据集总大小为164479763658字节,下载大小为95195640115字节,共包含16个分片,每个分片的字节数和样本数各不相同。
This dataset contains multiple shards, each containing a large amount of text data. The features of the dataset include id, text, added, created, source, original_shard_dir, original_shard_idx, and num_tokens. The total size of the dataset is 164479763658 bytes, with a download size of 95195640115 bytes. It consists of 16 shards, each with varying byte sizes and numbers of examples.
提供机构:
orionweller
原始信息汇总
数据集概述
特征信息
- id: 字符串类型
- text: 字符串类型
- added: 字符串类型
- created: 字符串类型
- source: 字符串类型
- original_shard_dir: 字符串类型
- original_shard_idx: 64位整数类型
- num_tokens: 64位整数类型
数据分割
- shard_0:
- 字节数: 10009546720
- 样本数: 2961337
- shard_1:
- 字节数: 10000610075
- 样本数: 2927550
- shard_2:
- 字节数: 10016337190
- 样本数: 2779305
- shard_3:
- 字节数: 10017902470
- 样本数: 2752958
- shard_4:
- 字节数: 10004535752
- 样本数: 2668522
- shard_5:
- 字节数: 10034434335
- 样本数: 2952057
- shard_6:
- 字节数: 10003197839
- 样本数: 2874213
- shard_7:
- 字节数: 10015906681
- 样本数: 3560266
- shard_8:
- 字节数: 10003761367
- 样本数: 3546339
- shard_9:
- 字节数: 10037135426
- 样本数: 3396425
- shard_10:
- 字节数: 10013395545
- 样本数: 3286193
- shard_11:
- 字节数: 10037101736
- 样本数: 3179396
- shard_12:
- 字节数: 10042787924
- 样本数: 3151083
- shard_13:
- 字节数: 10011405558
- 样本数: 2538235
- shard_14:
- 字节数: 10168167922
- 样本数: 6719632
- shard_15:
- 字节数: 10228275391
- 样本数: 2485018
- shard_16:
- 字节数: 3835261727
- 样本数: 1738939
数据集大小
- 下载大小: 95195640115 字节
- 数据集大小: 164479763658 字节
配置信息
- config_name: default
- data_files:
- shard_0: data/shard_0-*
- shard_1: data/shard_1-*
- shard_2: data/shard_2-*
- shard_3: data/shard_3-*
- shard_4: data/shard_4-*
- shard_5: data/shard_5-*
- shard_6: data/shard_6-*
- shard_7: data/shard_7-*
- shard_8: data/shard_8-*
- shard_9: data/shard_9-*
- shard_10: data/shard_10-*
- shard_11: data/shard_11-*
- shard_12: data/shard_12-*
- shard_13: data/shard_13-*
- shard_14: data/shard_14-*
- shard_15: data/shard_15-*
- shard_16: data/shard_16-*
- data_files:



