sanchit-gandhi/cosmopedia-logprobs

Name: sanchit-gandhi/cosmopedia-logprobs
Creator: sanchit-gandhi
Published: 2024-05-08 08:38:38
License: 暂无描述

Hugging Face2024-05-08 更新2024-06-12 收录

下载链接：

https://hf-mirror.com/datasets/sanchit-gandhi/cosmopedia-logprobs

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: - config_name: auto_math_text features: - name: prompt dtype: string - name: text_token_length dtype: int64 - name: text dtype: string - name: seed_data dtype: string - name: format dtype: string - name: audience dtype: string - name: prompt_length dtype: int64 - name: logprobs dtype: float32 splits: - name: train num_bytes: 2254352700 num_examples: 500000 download_size: 1142306970 dataset_size: 2254352700 - config_name: khanacademy features: - name: prompt dtype: string - name: text_token_length dtype: int64 - name: text dtype: string - name: seed_data dtype: string - name: format dtype: string - name: audience dtype: string - name: prompt_length dtype: int64 - name: logprobs dtype: float32 splits: - name: train num_bytes: 122255635 num_examples: 24123 download_size: 48957897 dataset_size: 122255635 - config_name: openstax features: - name: text_token_length dtype: int64 - name: prompt dtype: string - name: text dtype: string - name: seed_data dtype: string - name: format dtype: string - name: audience dtype: string - name: prompt_length dtype: int64 - name: logprobs dtype: float32 splits: - name: train num_bytes: 669353434 num_examples: 126332 download_size: 348325842 dataset_size: 669353434 - config_name: stanford features: - name: text_token_length dtype: int64 - name: prompt dtype: string - name: text dtype: string - name: seed_data dtype: string - name: format dtype: string - name: audience dtype: string - name: prompt_length dtype: int64 - name: logprobs dtype: float32 splits: - name: train num_bytes: 6228922595 num_examples: 1000000 download_size: 3244302299 dataset_size: 6228922595 - config_name: stories features: - name: text dtype: string - name: prompt dtype: string - name: text_token_length dtype: int64 - name: seed_data dtype: string - name: format dtype: string - name: audience dtype: string - name: prompt_length dtype: int64 - name: logprobs dtype: float32 splits: - name: train num_bytes: 2140433499 num_examples: 500000 download_size: 1188905527 dataset_size: 2140433499 - config_name: web_samples_v1 features: - name: text_token_length dtype: int64 - name: prompt dtype: string - name: text dtype: string - name: seed_data dtype: string - name: format dtype: string - name: audience dtype: string - name: prompt_length dtype: int64 - name: logprobs dtype: float32 splits: - name: train num_bytes: 2784556991 num_examples: 500000 download_size: 1571554899 dataset_size: 2784556991 - config_name: web_samples_v2 features: - name: text_token_length dtype: int64 - name: prompt dtype: string - name: text dtype: string - name: seed_data dtype: string - name: format dtype: string - name: audience dtype: string - name: prompt_length dtype: int64 - name: logprobs dtype: float32 splits: - name: train num_bytes: 2842778735 num_examples: 500000 download_size: 1582617724 dataset_size: 2842778735 - config_name: wikihow features: - name: text_token_length dtype: int64 - name: prompt dtype: string - name: text dtype: string - name: seed_data dtype: string - name: format dtype: string - name: audience dtype: string - name: prompt_length dtype: int64 - name: logprobs dtype: float32 splits: - name: train num_bytes: 894870820 num_examples: 179191 download_size: 503175527 dataset_size: 894870820 configs: - config_name: auto_math_text data_files: - split: train path: auto_math_text/train-* - config_name: khanacademy data_files: - split: train path: khanacademy/train-* - config_name: openstax data_files: - split: train path: openstax/train-* - config_name: stanford data_files: - split: train path: stanford/train-* - config_name: stories data_files: - split: train path: stories/train-* - config_name: web_samples_v1 data_files: - split: train path: web_samples_v1/train-* - config_name: web_samples_v2 data_files: - split: train path: web_samples_v2/train-* - config_name: wikihow data_files: - split: train path: wikihow/train-* ---

The dataset consists of multiple subsets, each with a specific configuration name and features. Main features include prompt, text token length, text, seed data, format, audience, prompt length, and log probabilities. The dataset sources include math text, Khan Academy, Openstax textbooks, Stanford, stories, web samples, and WikiHow. Each subset has training data and provides detailed data sizes and example counts.

提供机构：

sanchit-gandhi

原始信息汇总

数据集概述

1. auto_math_text

特征:
- prompt: string
- text_token_length: int64
- text: string
- seed_data: string
- format: string
- audience: string
- prompt_length: int64
- logprobs: float32
分割:
- train: 500000 examples, 2254352700 bytes
下载大小: 1142306970 bytes
数据集大小: 2254352700 bytes

2. khanacademy

特征:
- prompt: string
- text_token_length: int64
- text: string
- seed_data: string
- format: string
- audience: string
- prompt_length: int64
- logprobs: float32
分割:
- train: 24123 examples, 122255635 bytes
下载大小: 48957897 bytes
数据集大小: 122255635 bytes

3. openstax

特征:
- text_token_length: int64
- prompt: string
- text: string
- seed_data: string
- format: string
- audience: string
- prompt_length: int64
- logprobs: float32
分割:
- train: 126332 examples, 669353434 bytes
下载大小: 348325842 bytes
数据集大小: 669353434 bytes

4. stanford

特征:
- text_token_length: int64
- prompt: string
- text: string
- seed_data: string
- format: string
- audience: string
- prompt_length: int64
- logprobs: float32
分割:
- train: 1000000 examples, 6228922595 bytes
下载大小: 3244302299 bytes
数据集大小: 6228922595 bytes

5. stories

特征:
- text: string
- prompt: string
- text_token_length: int64
- seed_data: string
- format: string
- audience: string
- prompt_length: int64
- logprobs: float32
分割:
- train: 500000 examples, 2140433499 bytes
下载大小: 1188905527 bytes
数据集大小: 2140433499 bytes

6. web_samples_v1

特征:
- text_token_length: int64
- prompt: string
- text: string
- seed_data: string
- format: string
- audience: string
- prompt_length: int64
- logprobs: float32
分割:
- train: 500000 examples, 2784556991 bytes
下载大小: 1571554899 bytes
数据集大小: 2784556991 bytes

7. web_samples_v2

特征:
- text_token_length: int64
- prompt: string
- text: string
- seed_data: string
- format: string
- audience: string
- prompt_length: int64
- logprobs: float32
分割:
- train: 500000 examples, 2842778735 bytes
下载大小: 1582617724 bytes
数据集大小: 2842778735 bytes

8. wikihow

特征:
- text_token_length: int64
- prompt: string
- text: string
- seed_data: string
- format: string
- audience: string
- prompt_length: int64
- logprobs: float32
分割:
- train: 179191 examples, 894870820 bytes
下载大小: 503175527 bytes
数据集大小: 894870820 bytes

5,000+

优质数据集

54 个

任务类型

进入经典数据集