five

orionweller/dolma_20bn_wiki_upsample

收藏
Hugging Face2024-06-12 更新2024-06-29 收录
下载链接:
https://hf-mirror.com/datasets/orionweller/dolma_20bn_wiki_upsample
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: id dtype: string - name: text dtype: string - name: added dtype: string - name: created dtype: string - name: source dtype: string - name: original_shard_dir dtype: string - name: original_shard_idx dtype: int64 - name: num_tokens dtype: int64 splits: - name: shard_0 num_bytes: 10048343063 num_examples: 3082936 - name: shard_1 num_bytes: 10025703829 num_examples: 2736677 - name: shard_2 num_bytes: 10015117262 num_examples: 2722726 - name: shard_3 num_bytes: 10002162828 num_examples: 2850395 - name: shard_4 num_bytes: 10048812357 num_examples: 2893974 - name: shard_5 num_bytes: 10016959439 num_examples: 3759486 - name: shard_6 num_bytes: 10043574169 num_examples: 3389532 - name: shard_7 num_bytes: 10011168227 num_examples: 3183976 - name: shard_8 num_bytes: 10019125382 num_examples: 3147012 - name: shard_9 num_bytes: 10043973897 num_examples: 4916390 - name: shard_10 num_bytes: 10136633345 num_examples: 2857695 - name: shard_11 num_bytes: 11034916419 num_examples: 3568971 - name: shard_12 num_bytes: 5259699689 num_examples: 2676658 download_size: 73281475328 dataset_size: 126706189906 configs: - config_name: default data_files: - split: shard_0 path: data/shard_0-* - split: shard_1 path: data/shard_1-* - split: shard_2 path: data/shard_2-* - split: shard_3 path: data/shard_3-* - split: shard_4 path: data/shard_4-* - split: shard_5 path: data/shard_5-* - split: shard_6 path: data/shard_6-* - split: shard_7 path: data/shard_7-* - split: shard_8 path: data/shard_8-* - split: shard_9 path: data/shard_9-* - split: shard_10 path: data/shard_10-* - split: shard_11 path: data/shard_11-* - split: shard_12 path: data/shard_12-* ---
提供机构:
orionweller
原始信息汇总

数据集概述

数据集特征

  • id: 字符串类型
  • text: 字符串类型
  • added: 字符串类型
  • created: 字符串类型
  • source: 字符串类型
  • original_shard_dir: 字符串类型
  • original_shard_idx: 64位整数类型
  • num_tokens: 64位整数类型

数据集分片信息

  • shard_0:
    • 字节数: 10048343063
    • 样本数: 3082936
  • shard_1:
    • 字节数: 10025703829
    • 样本数: 2736677
  • shard_2:
    • 字节数: 10015117262
    • 样本数: 2722726
  • shard_3:
    • 字节数: 10002162828
    • 样本数: 2850395
  • shard_4:
    • 字节数: 10048812357
    • 样本数: 2893974
  • shard_5:
    • 字节数: 10016959439
    • 样本数: 3759486
  • shard_6:
    • 字节数: 10043574169
    • 样本数: 3389532
  • shard_7:
    • 字节数: 10011168227
    • 样本数: 3183976
  • shard_8:
    • 字节数: 10019125382
    • 样本数: 3147012
  • shard_9:
    • 字节数: 10043973897
    • 样本数: 4916390
  • shard_10:
    • 字节数: 10136633345
    • 样本数: 2857695
  • shard_11:
    • 字节数: 11034916419
    • 样本数: 3568971
  • shard_12:
    • 字节数: 5259699689
    • 样本数: 2676658

数据集大小

  • 下载大小: 73281475328 字节
  • 数据集大小: 126706189906 字节

配置信息

  • config_name: default
    • data_files:
      • shard_0: data/shard_0-*
      • shard_1: data/shard_1-*
      • shard_2: data/shard_2-*
      • shard_3: data/shard_3-*
      • shard_4: data/shard_4-*
      • shard_5: data/shard_5-*
      • shard_6: data/shard_6-*
      • shard_7: data/shard_7-*
      • shard_8: data/shard_8-*
      • shard_9: data/shard_9-*
      • shard_10: data/shard_10-*
      • shard_11: data/shard_11-*
      • shard_12: data/shard_12-*
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作