five

sanchit-gandhi/cosmopedia-logprobs

收藏
Hugging Face2024-05-08 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/sanchit-gandhi/cosmopedia-logprobs
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: - config_name: auto_math_text features: - name: prompt dtype: string - name: text_token_length dtype: int64 - name: text dtype: string - name: seed_data dtype: string - name: format dtype: string - name: audience dtype: string - name: prompt_length dtype: int64 - name: logprobs dtype: float32 splits: - name: train num_bytes: 2254352700 num_examples: 500000 download_size: 1142306970 dataset_size: 2254352700 - config_name: khanacademy features: - name: prompt dtype: string - name: text_token_length dtype: int64 - name: text dtype: string - name: seed_data dtype: string - name: format dtype: string - name: audience dtype: string - name: prompt_length dtype: int64 - name: logprobs dtype: float32 splits: - name: train num_bytes: 122255635 num_examples: 24123 download_size: 48957897 dataset_size: 122255635 - config_name: openstax features: - name: text_token_length dtype: int64 - name: prompt dtype: string - name: text dtype: string - name: seed_data dtype: string - name: format dtype: string - name: audience dtype: string - name: prompt_length dtype: int64 - name: logprobs dtype: float32 splits: - name: train num_bytes: 669353434 num_examples: 126332 download_size: 348325842 dataset_size: 669353434 - config_name: stanford features: - name: text_token_length dtype: int64 - name: prompt dtype: string - name: text dtype: string - name: seed_data dtype: string - name: format dtype: string - name: audience dtype: string - name: prompt_length dtype: int64 - name: logprobs dtype: float32 splits: - name: train num_bytes: 6228922595 num_examples: 1000000 download_size: 3244302299 dataset_size: 6228922595 - config_name: stories features: - name: text dtype: string - name: prompt dtype: string - name: text_token_length dtype: int64 - name: seed_data dtype: string - name: format dtype: string - name: audience dtype: string - name: prompt_length dtype: int64 - name: logprobs dtype: float32 splits: - name: train num_bytes: 2140433499 num_examples: 500000 download_size: 1188905527 dataset_size: 2140433499 - config_name: web_samples_v1 features: - name: text_token_length dtype: int64 - name: prompt dtype: string - name: text dtype: string - name: seed_data dtype: string - name: format dtype: string - name: audience dtype: string - name: prompt_length dtype: int64 - name: logprobs dtype: float32 splits: - name: train num_bytes: 2784556991 num_examples: 500000 download_size: 1571554899 dataset_size: 2784556991 - config_name: web_samples_v2 features: - name: text_token_length dtype: int64 - name: prompt dtype: string - name: text dtype: string - name: seed_data dtype: string - name: format dtype: string - name: audience dtype: string - name: prompt_length dtype: int64 - name: logprobs dtype: float32 splits: - name: train num_bytes: 2842778735 num_examples: 500000 download_size: 1582617724 dataset_size: 2842778735 - config_name: wikihow features: - name: text_token_length dtype: int64 - name: prompt dtype: string - name: text dtype: string - name: seed_data dtype: string - name: format dtype: string - name: audience dtype: string - name: prompt_length dtype: int64 - name: logprobs dtype: float32 splits: - name: train num_bytes: 894870820 num_examples: 179191 download_size: 503175527 dataset_size: 894870820 configs: - config_name: auto_math_text data_files: - split: train path: auto_math_text/train-* - config_name: khanacademy data_files: - split: train path: khanacademy/train-* - config_name: openstax data_files: - split: train path: openstax/train-* - config_name: stanford data_files: - split: train path: stanford/train-* - config_name: stories data_files: - split: train path: stories/train-* - config_name: web_samples_v1 data_files: - split: train path: web_samples_v1/train-* - config_name: web_samples_v2 data_files: - split: train path: web_samples_v2/train-* - config_name: wikihow data_files: - split: train path: wikihow/train-* ---

The dataset consists of multiple subsets, each with a specific configuration name and features. Main features include prompt, text token length, text, seed data, format, audience, prompt length, and log probabilities. The dataset sources include math text, Khan Academy, Openstax textbooks, Stanford, stories, web samples, and WikiHow. Each subset has training data and provides detailed data sizes and example counts.
提供机构:
sanchit-gandhi
原始信息汇总

数据集概述

1. auto_math_text

  • 特征:
    • prompt: string
    • text_token_length: int64
    • text: string
    • seed_data: string
    • format: string
    • audience: string
    • prompt_length: int64
    • logprobs: float32
  • 分割:
    • train: 500000 examples, 2254352700 bytes
  • 下载大小: 1142306970 bytes
  • 数据集大小: 2254352700 bytes

2. khanacademy

  • 特征:
    • prompt: string
    • text_token_length: int64
    • text: string
    • seed_data: string
    • format: string
    • audience: string
    • prompt_length: int64
    • logprobs: float32
  • 分割:
    • train: 24123 examples, 122255635 bytes
  • 下载大小: 48957897 bytes
  • 数据集大小: 122255635 bytes

3. openstax

  • 特征:
    • text_token_length: int64
    • prompt: string
    • text: string
    • seed_data: string
    • format: string
    • audience: string
    • prompt_length: int64
    • logprobs: float32
  • 分割:
    • train: 126332 examples, 669353434 bytes
  • 下载大小: 348325842 bytes
  • 数据集大小: 669353434 bytes

4. stanford

  • 特征:
    • text_token_length: int64
    • prompt: string
    • text: string
    • seed_data: string
    • format: string
    • audience: string
    • prompt_length: int64
    • logprobs: float32
  • 分割:
    • train: 1000000 examples, 6228922595 bytes
  • 下载大小: 3244302299 bytes
  • 数据集大小: 6228922595 bytes

5. stories

  • 特征:
    • text: string
    • prompt: string
    • text_token_length: int64
    • seed_data: string
    • format: string
    • audience: string
    • prompt_length: int64
    • logprobs: float32
  • 分割:
    • train: 500000 examples, 2140433499 bytes
  • 下载大小: 1188905527 bytes
  • 数据集大小: 2140433499 bytes

6. web_samples_v1

  • 特征:
    • text_token_length: int64
    • prompt: string
    • text: string
    • seed_data: string
    • format: string
    • audience: string
    • prompt_length: int64
    • logprobs: float32
  • 分割:
    • train: 500000 examples, 2784556991 bytes
  • 下载大小: 1571554899 bytes
  • 数据集大小: 2784556991 bytes

7. web_samples_v2

  • 特征:
    • text_token_length: int64
    • prompt: string
    • text: string
    • seed_data: string
    • format: string
    • audience: string
    • prompt_length: int64
    • logprobs: float32
  • 分割:
    • train: 500000 examples, 2842778735 bytes
  • 下载大小: 1582617724 bytes
  • 数据集大小: 2842778735 bytes

8. wikihow

  • 特征:
    • text_token_length: int64
    • prompt: string
    • text: string
    • seed_data: string
    • format: string
    • audience: string
    • prompt_length: int64
    • logprobs: float32
  • 分割:
    • train: 179191 examples, 894870820 bytes
  • 下载大小: 503175527 bytes
  • 数据集大小: 894870820 bytes
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作