jackeyhug/jackeydataset
收藏Hugging Face2024-03-17 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/jackeyhug/jackeydataset
下载链接
链接失效反馈官方服务:
资源简介:
dataset_info:
- config_name: auto_math_text
features:
- name: prompt
dtype: string
- name: text_token_length
dtype: int64
- name: text
dtype: string
- name: seed_data
dtype: string
- name: format
dtype: string
- name: audience
dtype: string
splits:
- name: train
num_bytes: 8777587297.907892
num_examples: 1949895
download_size: 4461401898
dataset_size: 8777587297.907892
- config_name: khanacademy
features:
- name: prompt
dtype: string
- name: text_token_length
dtype: int64
- name: text
dtype: string
- name: seed_data
dtype: string
- name: format
dtype: string
- name: audience
dtype: string
splits:
- name: train
num_bytes: 108591354.09210858
num_examples: 24123
download_size: 49139761
dataset_size: 108591354.09210858
- config_name: openstax
features:
- name: text_token_length
dtype: int64
- name: prompt
dtype: string
- name: text
dtype: string
- name: seed_data
dtype: string
- name: format
dtype: string
- name: audience
dtype: string
splits:
- name: train
num_bytes: 667837450
num_examples: 126332
download_size: 346992522
dataset_size: 667837450
- config_name: stanford
features:
- name: text_token_length
dtype: int64
- name: prompt
dtype: string
- name: text
dtype: string
- name: seed_data
dtype: string
- name: format
dtype: string
- name: audience
dtype: string
splits:
- name: train
num_bytes: 6341291506
num_examples: 1020024
download_size: 3302284560
dataset_size: 6341291506
- config_name: stories
features:
- name: text
dtype: string
- name: prompt
dtype: string
- name: text_token_length
dtype: int64
- name: seed_data
dtype: string
- name: format
dtype: string
- name: audience
dtype: string
splits:
- name: train
num_bytes: 21314739648
num_examples: 4992964
download_size: 11902294709
dataset_size: 21314739648
- config_name: web_samples_v1
features:
- name: text_token_length
dtype: int64
- name: prompt
dtype: string
- name: text
dtype: string
- name: seed_data
dtype: string
- name: format
dtype: string
- name: audience
dtype: string
splits:
- name: train
num_bytes: 69075726295
num_examples: 12426348
download_size: 38978124936
dataset_size: 69075726295
- config_name: web_samples_v2
features:
- name: text_token_length
dtype: int64
- name: prompt
dtype: string
- name: text
dtype: string
- name: seed_data
dtype: string
- name: format
dtype: string
- name: audience
dtype: string
splits:
- name: train
num_bytes: 58711802939
num_examples: 10345867
download_size: 32658254617
dataset_size: 58711802939
- config_name: wikihow
features:
- name: text_token_length
dtype: int64
- name: prompt
dtype: string
- name: text
dtype: string
- name: seed_data
dtype: string
- name: format
dtype: string
- name: audience
dtype: string
splits:
- name: train
num_bytes: 892720528
num_examples: 179191
download_size: 502284600
dataset_size: 892720528
configs:
- config_name: auto_math_text
data_files:
- split: train
path: data/auto_math_text/train-*
- config_name: khanacademy
data_files:
- split: train
path: data/khanacademy/train-*
- config_name: openstax
data_files:
- split: train
path: data/openstax/train-*
- config_name: stanford
data_files:
- split: train
path: data/stanford/train-*
- config_name: stories
data_files:
- split: train
path: data/stories/train-*
- config_name: web_samples_v1
data_files:
- split: train
path: data/web_samples_v1/train-*
- config_name: web_samples_v2
data_files:
- split: train
path: data/web_samples_v2/train-*
- config_name: wikihow
data_files:
- split: train
path: data/wikihow/train-*
license: apache-2.0
language:
- en
tags:
- synthetic
提供机构:
jackeyhug
原始信息汇总
数据集概述
数据集配置
1. auto_math_text
- 特征:
prompt: stringtext_token_length: int64text: stringseed_data: stringformat: stringaudience: string
- 分割:
train:num_bytes: 8777587297.907892num_examples: 1949895
- 下载大小: 4461401898
- 数据集大小: 8777587297.907892
2. khanacademy
- 特征:
prompt: stringtext_token_length: int64text: stringseed_data: stringformat: stringaudience: string
- 分割:
train:num_bytes: 108591354.09210858num_examples: 24123
- 下载大小: 49139761
- 数据集大小: 108591354.09210858
3. openstax
- 特征:
text_token_length: int64prompt: stringtext: stringseed_data: stringformat: stringaudience: string
- 分割:
train:num_bytes: 667837450num_examples: 126332
- 下载大小: 346992522
- 数据集大小: 667837450
4. stanford
- 特征:
text_token_length: int64prompt: stringtext: stringseed_data: stringformat: stringaudience: string
- 分割:
train:num_bytes: 6341291506num_examples: 1020024
- 下载大小: 3302284560
- 数据集大小: 6341291506
5. stories
- 特征:
text: stringprompt: stringtext_token_length: int64seed_data: stringformat: stringaudience: string
- 分割:
train:num_bytes: 21314739648num_examples: 4992964
- 下载大小: 11902294709
- 数据集大小: 21314739648
6. web_samples_v1
- 特征:
text_token_length: int64prompt: stringtext: stringseed_data: stringformat: stringaudience: string
- 分割:
train:num_bytes: 69075726295num_examples: 12426348
- 下载大小: 38978124936
- 数据集大小: 69075726295
7. web_samples_v2
- 特征:
text_token_length: int64prompt: stringtext: stringseed_data: stringformat: stringaudience: string
- 分割:
train:num_bytes: 58711802939num_examples: 10345867
- 下载大小: 32658254617
- 数据集大小: 58711802939
8. wikihow
- 特征:
text_token_length: int64prompt: stringtext: stringseed_data: stringformat: stringaudience: string
- 分割:
train:num_bytes: 892720528num_examples: 179191
- 下载大小: 502284600
- 数据集大小: 892720528
数据文件路径
- auto_math_text:
train: data/auto_math_text/train-*
- khanacademy:
train: data/khanacademy/train-*
- openstax:
train: data/openstax/train-*
- stanford:
train: data/stanford/train-*
- stories:
train: data/stories/train-*
- web_samples_v1:
train: data/web_samples_v1/train-*
- web_samples_v2:
train: data/web_samples_v2/train-*
- wikihow:
train: data/wikihow/train-*
许可证
apache-2.0
语言
en
标签
synthetic



