five

jackeyhug/jackeydataset

收藏
Hugging Face2024-03-17 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/jackeyhug/jackeydataset
下载链接
链接失效反馈
官方服务:
资源简介:
dataset_info: - config_name: auto_math_text features: - name: prompt dtype: string - name: text_token_length dtype: int64 - name: text dtype: string - name: seed_data dtype: string - name: format dtype: string - name: audience dtype: string splits: - name: train num_bytes: 8777587297.907892 num_examples: 1949895 download_size: 4461401898 dataset_size: 8777587297.907892 - config_name: khanacademy features: - name: prompt dtype: string - name: text_token_length dtype: int64 - name: text dtype: string - name: seed_data dtype: string - name: format dtype: string - name: audience dtype: string splits: - name: train num_bytes: 108591354.09210858 num_examples: 24123 download_size: 49139761 dataset_size: 108591354.09210858 - config_name: openstax features: - name: text_token_length dtype: int64 - name: prompt dtype: string - name: text dtype: string - name: seed_data dtype: string - name: format dtype: string - name: audience dtype: string splits: - name: train num_bytes: 667837450 num_examples: 126332 download_size: 346992522 dataset_size: 667837450 - config_name: stanford features: - name: text_token_length dtype: int64 - name: prompt dtype: string - name: text dtype: string - name: seed_data dtype: string - name: format dtype: string - name: audience dtype: string splits: - name: train num_bytes: 6341291506 num_examples: 1020024 download_size: 3302284560 dataset_size: 6341291506 - config_name: stories features: - name: text dtype: string - name: prompt dtype: string - name: text_token_length dtype: int64 - name: seed_data dtype: string - name: format dtype: string - name: audience dtype: string splits: - name: train num_bytes: 21314739648 num_examples: 4992964 download_size: 11902294709 dataset_size: 21314739648 - config_name: web_samples_v1 features: - name: text_token_length dtype: int64 - name: prompt dtype: string - name: text dtype: string - name: seed_data dtype: string - name: format dtype: string - name: audience dtype: string splits: - name: train num_bytes: 69075726295 num_examples: 12426348 download_size: 38978124936 dataset_size: 69075726295 - config_name: web_samples_v2 features: - name: text_token_length dtype: int64 - name: prompt dtype: string - name: text dtype: string - name: seed_data dtype: string - name: format dtype: string - name: audience dtype: string splits: - name: train num_bytes: 58711802939 num_examples: 10345867 download_size: 32658254617 dataset_size: 58711802939 - config_name: wikihow features: - name: text_token_length dtype: int64 - name: prompt dtype: string - name: text dtype: string - name: seed_data dtype: string - name: format dtype: string - name: audience dtype: string splits: - name: train num_bytes: 892720528 num_examples: 179191 download_size: 502284600 dataset_size: 892720528 configs: - config_name: auto_math_text data_files: - split: train path: data/auto_math_text/train-* - config_name: khanacademy data_files: - split: train path: data/khanacademy/train-* - config_name: openstax data_files: - split: train path: data/openstax/train-* - config_name: stanford data_files: - split: train path: data/stanford/train-* - config_name: stories data_files: - split: train path: data/stories/train-* - config_name: web_samples_v1 data_files: - split: train path: data/web_samples_v1/train-* - config_name: web_samples_v2 data_files: - split: train path: data/web_samples_v2/train-* - config_name: wikihow data_files: - split: train path: data/wikihow/train-* license: apache-2.0 language: - en tags: - synthetic
提供机构:
jackeyhug
原始信息汇总

数据集概述

数据集配置

1. auto_math_text

  • 特征:
    • prompt: string
    • text_token_length: int64
    • text: string
    • seed_data: string
    • format: string
    • audience: string
  • 分割:
    • train:
      • num_bytes: 8777587297.907892
      • num_examples: 1949895
  • 下载大小: 4461401898
  • 数据集大小: 8777587297.907892

2. khanacademy

  • 特征:
    • prompt: string
    • text_token_length: int64
    • text: string
    • seed_data: string
    • format: string
    • audience: string
  • 分割:
    • train:
      • num_bytes: 108591354.09210858
      • num_examples: 24123
  • 下载大小: 49139761
  • 数据集大小: 108591354.09210858

3. openstax

  • 特征:
    • text_token_length: int64
    • prompt: string
    • text: string
    • seed_data: string
    • format: string
    • audience: string
  • 分割:
    • train:
      • num_bytes: 667837450
      • num_examples: 126332
  • 下载大小: 346992522
  • 数据集大小: 667837450

4. stanford

  • 特征:
    • text_token_length: int64
    • prompt: string
    • text: string
    • seed_data: string
    • format: string
    • audience: string
  • 分割:
    • train:
      • num_bytes: 6341291506
      • num_examples: 1020024
  • 下载大小: 3302284560
  • 数据集大小: 6341291506

5. stories

  • 特征:
    • text: string
    • prompt: string
    • text_token_length: int64
    • seed_data: string
    • format: string
    • audience: string
  • 分割:
    • train:
      • num_bytes: 21314739648
      • num_examples: 4992964
  • 下载大小: 11902294709
  • 数据集大小: 21314739648

6. web_samples_v1

  • 特征:
    • text_token_length: int64
    • prompt: string
    • text: string
    • seed_data: string
    • format: string
    • audience: string
  • 分割:
    • train:
      • num_bytes: 69075726295
      • num_examples: 12426348
  • 下载大小: 38978124936
  • 数据集大小: 69075726295

7. web_samples_v2

  • 特征:
    • text_token_length: int64
    • prompt: string
    • text: string
    • seed_data: string
    • format: string
    • audience: string
  • 分割:
    • train:
      • num_bytes: 58711802939
      • num_examples: 10345867
  • 下载大小: 32658254617
  • 数据集大小: 58711802939

8. wikihow

  • 特征:
    • text_token_length: int64
    • prompt: string
    • text: string
    • seed_data: string
    • format: string
    • audience: string
  • 分割:
    • train:
      • num_bytes: 892720528
      • num_examples: 179191
  • 下载大小: 502284600
  • 数据集大小: 892720528

数据文件路径

  • auto_math_text:
    • train: data/auto_math_text/train-*
  • khanacademy:
    • train: data/khanacademy/train-*
  • openstax:
    • train: data/openstax/train-*
  • stanford:
    • train: data/stanford/train-*
  • stories:
    • train: data/stories/train-*
  • web_samples_v1:
    • train: data/web_samples_v1/train-*
  • web_samples_v2:
    • train: data/web_samples_v2/train-*
  • wikihow:
    • train: data/wikihow/train-*

许可证

  • apache-2.0

语言

  • en

标签

  • synthetic
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作