sanchit-gandhi/cosmopedia-logprobs
收藏Hugging Face2024-05-08 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/sanchit-gandhi/cosmopedia-logprobs
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
- config_name: auto_math_text
features:
- name: prompt
dtype: string
- name: text_token_length
dtype: int64
- name: text
dtype: string
- name: seed_data
dtype: string
- name: format
dtype: string
- name: audience
dtype: string
- name: prompt_length
dtype: int64
- name: logprobs
dtype: float32
splits:
- name: train
num_bytes: 2254352700
num_examples: 500000
download_size: 1142306970
dataset_size: 2254352700
- config_name: khanacademy
features:
- name: prompt
dtype: string
- name: text_token_length
dtype: int64
- name: text
dtype: string
- name: seed_data
dtype: string
- name: format
dtype: string
- name: audience
dtype: string
- name: prompt_length
dtype: int64
- name: logprobs
dtype: float32
splits:
- name: train
num_bytes: 122255635
num_examples: 24123
download_size: 48957897
dataset_size: 122255635
- config_name: openstax
features:
- name: text_token_length
dtype: int64
- name: prompt
dtype: string
- name: text
dtype: string
- name: seed_data
dtype: string
- name: format
dtype: string
- name: audience
dtype: string
- name: prompt_length
dtype: int64
- name: logprobs
dtype: float32
splits:
- name: train
num_bytes: 669353434
num_examples: 126332
download_size: 348325842
dataset_size: 669353434
- config_name: stanford
features:
- name: text_token_length
dtype: int64
- name: prompt
dtype: string
- name: text
dtype: string
- name: seed_data
dtype: string
- name: format
dtype: string
- name: audience
dtype: string
- name: prompt_length
dtype: int64
- name: logprobs
dtype: float32
splits:
- name: train
num_bytes: 6228922595
num_examples: 1000000
download_size: 3244302299
dataset_size: 6228922595
- config_name: stories
features:
- name: text
dtype: string
- name: prompt
dtype: string
- name: text_token_length
dtype: int64
- name: seed_data
dtype: string
- name: format
dtype: string
- name: audience
dtype: string
- name: prompt_length
dtype: int64
- name: logprobs
dtype: float32
splits:
- name: train
num_bytes: 2140433499
num_examples: 500000
download_size: 1188905527
dataset_size: 2140433499
- config_name: web_samples_v1
features:
- name: text_token_length
dtype: int64
- name: prompt
dtype: string
- name: text
dtype: string
- name: seed_data
dtype: string
- name: format
dtype: string
- name: audience
dtype: string
- name: prompt_length
dtype: int64
- name: logprobs
dtype: float32
splits:
- name: train
num_bytes: 2784556991
num_examples: 500000
download_size: 1571554899
dataset_size: 2784556991
- config_name: web_samples_v2
features:
- name: text_token_length
dtype: int64
- name: prompt
dtype: string
- name: text
dtype: string
- name: seed_data
dtype: string
- name: format
dtype: string
- name: audience
dtype: string
- name: prompt_length
dtype: int64
- name: logprobs
dtype: float32
splits:
- name: train
num_bytes: 2842778735
num_examples: 500000
download_size: 1582617724
dataset_size: 2842778735
- config_name: wikihow
features:
- name: text_token_length
dtype: int64
- name: prompt
dtype: string
- name: text
dtype: string
- name: seed_data
dtype: string
- name: format
dtype: string
- name: audience
dtype: string
- name: prompt_length
dtype: int64
- name: logprobs
dtype: float32
splits:
- name: train
num_bytes: 894870820
num_examples: 179191
download_size: 503175527
dataset_size: 894870820
configs:
- config_name: auto_math_text
data_files:
- split: train
path: auto_math_text/train-*
- config_name: khanacademy
data_files:
- split: train
path: khanacademy/train-*
- config_name: openstax
data_files:
- split: train
path: openstax/train-*
- config_name: stanford
data_files:
- split: train
path: stanford/train-*
- config_name: stories
data_files:
- split: train
path: stories/train-*
- config_name: web_samples_v1
data_files:
- split: train
path: web_samples_v1/train-*
- config_name: web_samples_v2
data_files:
- split: train
path: web_samples_v2/train-*
- config_name: wikihow
data_files:
- split: train
path: wikihow/train-*
---
The dataset consists of multiple subsets, each with a specific configuration name and features. Main features include prompt, text token length, text, seed data, format, audience, prompt length, and log probabilities. The dataset sources include math text, Khan Academy, Openstax textbooks, Stanford, stories, web samples, and WikiHow. Each subset has training data and provides detailed data sizes and example counts.
提供机构:
sanchit-gandhi
原始信息汇总
数据集概述
1. auto_math_text
- 特征:
- prompt: string
- text_token_length: int64
- text: string
- seed_data: string
- format: string
- audience: string
- prompt_length: int64
- logprobs: float32
- 分割:
- train: 500000 examples, 2254352700 bytes
- 下载大小: 1142306970 bytes
- 数据集大小: 2254352700 bytes
2. khanacademy
- 特征:
- prompt: string
- text_token_length: int64
- text: string
- seed_data: string
- format: string
- audience: string
- prompt_length: int64
- logprobs: float32
- 分割:
- train: 24123 examples, 122255635 bytes
- 下载大小: 48957897 bytes
- 数据集大小: 122255635 bytes
3. openstax
- 特征:
- text_token_length: int64
- prompt: string
- text: string
- seed_data: string
- format: string
- audience: string
- prompt_length: int64
- logprobs: float32
- 分割:
- train: 126332 examples, 669353434 bytes
- 下载大小: 348325842 bytes
- 数据集大小: 669353434 bytes
4. stanford
- 特征:
- text_token_length: int64
- prompt: string
- text: string
- seed_data: string
- format: string
- audience: string
- prompt_length: int64
- logprobs: float32
- 分割:
- train: 1000000 examples, 6228922595 bytes
- 下载大小: 3244302299 bytes
- 数据集大小: 6228922595 bytes
5. stories
- 特征:
- text: string
- prompt: string
- text_token_length: int64
- seed_data: string
- format: string
- audience: string
- prompt_length: int64
- logprobs: float32
- 分割:
- train: 500000 examples, 2140433499 bytes
- 下载大小: 1188905527 bytes
- 数据集大小: 2140433499 bytes
6. web_samples_v1
- 特征:
- text_token_length: int64
- prompt: string
- text: string
- seed_data: string
- format: string
- audience: string
- prompt_length: int64
- logprobs: float32
- 分割:
- train: 500000 examples, 2784556991 bytes
- 下载大小: 1571554899 bytes
- 数据集大小: 2784556991 bytes
7. web_samples_v2
- 特征:
- text_token_length: int64
- prompt: string
- text: string
- seed_data: string
- format: string
- audience: string
- prompt_length: int64
- logprobs: float32
- 分割:
- train: 500000 examples, 2842778735 bytes
- 下载大小: 1582617724 bytes
- 数据集大小: 2842778735 bytes
8. wikihow
- 特征:
- text_token_length: int64
- prompt: string
- text: string
- seed_data: string
- format: string
- audience: string
- prompt_length: int64
- logprobs: float32
- 分割:
- train: 179191 examples, 894870820 bytes
- 下载大小: 503175527 bytes
- 数据集大小: 894870820 bytes



