kenhktsui/cosmopedia_quality_score_v2
收藏Hugging Face2024-05-26 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/kenhktsui/cosmopedia_quality_score_v2
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
- config_name: auto_math_text
features:
- name: prompt
dtype: string
- name: text_token_length
dtype: int64
- name: text
dtype: string
- name: seed_data
dtype: string
- name: format
dtype: string
- name: audience
dtype: string
- name: quality_score_v2
dtype: float64
splits:
- name: train
num_bytes: 8779811653
num_examples: 1949895
download_size: 4458739426
dataset_size: 8779811653
- config_name: khanacademy
features:
- name: prompt
dtype: string
- name: text_token_length
dtype: int64
- name: text
dtype: string
- name: seed_data
dtype: string
- name: format
dtype: string
- name: audience
dtype: string
- name: quality_score_v2
dtype: float64
splits:
- name: train
num_bytes: 122159143
num_examples: 24123
download_size: 48951116
dataset_size: 122159143
- config_name: openstax
features:
- name: text_token_length
dtype: int64
- name: prompt
dtype: string
- name: text
dtype: string
- name: seed_data
dtype: string
- name: format
dtype: string
- name: audience
dtype: string
- name: quality_score_v2
dtype: float64
splits:
- name: train
num_bytes: 668848106
num_examples: 126332
download_size: 348252139
dataset_size: 668848106
- config_name: stanford
features:
- name: text_token_length
dtype: int64
- name: prompt
dtype: string
- name: text
dtype: string
- name: seed_data
dtype: string
- name: format
dtype: string
- name: audience
dtype: string
- name: quality_score_v2
dtype: float64
splits:
- name: train
num_bytes: 6349451698
num_examples: 1020024
download_size: 3309575750
dataset_size: 6349451698
- config_name: stories
features:
- name: text
dtype: string
- name: prompt
dtype: string
- name: text_token_length
dtype: int64
- name: seed_data
dtype: string
- name: format
dtype: string
- name: audience
dtype: string
- name: quality_score_v2
dtype: float64
splits:
- name: train
num_bytes: 21354683360
num_examples: 4992964
download_size: 11894225941
dataset_size: 21354683360
- config_name: web_samples_v1
features:
- name: text_token_length
dtype: int64
- name: prompt
dtype: string
- name: text
dtype: string
- name: seed_data
dtype: string
- name: format
dtype: string
- name: audience
dtype: string
- name: quality_score_v2
dtype: float64
splits:
- name: train
num_bytes: 69175137079
num_examples: 12426348
download_size: 39069495077
dataset_size: 69175137079
- config_name: web_samples_v2
features:
- name: text_token_length
dtype: int64
- name: prompt
dtype: string
- name: text
dtype: string
- name: seed_data
dtype: string
- name: format
dtype: string
- name: audience
dtype: string
- name: quality_score_v2
dtype: float64
splits:
- name: train
num_bytes: 58794569875
num_examples: 10345867
download_size: 32737483363
dataset_size: 58794569875
- config_name: wikihow
features:
- name: text_token_length
dtype: int64
- name: prompt
dtype: string
- name: text
dtype: string
- name: seed_data
dtype: string
- name: format
dtype: string
- name: audience
dtype: string
- name: quality_score_v2
dtype: float64
splits:
- name: train
num_bytes: 894154056
num_examples: 179191
download_size: 503349411
dataset_size: 894154056
configs:
- config_name: auto_math_text
data_files:
- split: train
path: auto_math_text/train-*
- config_name: khanacademy
data_files:
- split: train
path: khanacademy/train-*
- config_name: openstax
data_files:
- split: train
path: openstax/train-*
- config_name: stanford
data_files:
- split: train
path: stanford/train-*
- config_name: stories
data_files:
- split: train
path: stories/train-*
- config_name: web_samples_v1
data_files:
- split: train
path: web_samples_v1/train-*
- config_name: web_samples_v2
data_files:
- split: train
path: web_samples_v2/train-*
- config_name: wikihow
data_files:
- split: train
path: wikihow/train-*
---
Adding [quality score v2](https://huggingface.co/kenhktsui/llm-data-textbook-quality-fasttext-classifier-v2) to [HuggingFaceTB/cosmopedia](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia)
提供机构:
kenhktsui
原始信息汇总
数据集概述
1. auto_math_text
- 特征:
- prompt: 字符串
- text_token_length: 整数
- text: 字符串
- seed_data: 字符串
- format: 字符串
- audience: 字符串
- quality_score_v2: 浮点数
- 分割:
- train: 1949895个样本,大小8779811653字节
- 下载大小: 4458739426字节
- 数据集大小: 8779811653字节
2. khanacademy
- 特征:
- prompt: 字符串
- text_token_length: 整数
- text: 字符串
- seed_data: 字符串
- format: 字符串
- audience: 字符串
- quality_score_v2: 浮点数
- 分割:
- train: 24123个样本,大小122159143字节
- 下载大小: 48951116字节
- 数据集大小: 122159143字节
3. openstax
- 特征:
- text_token_length: 整数
- prompt: 字符串
- text: 字符串
- seed_data: 字符串
- format: 字符串
- audience: 字符串
- quality_score_v2: 浮点数
- 分割:
- train: 126332个样本,大小668848106字节
- 下载大小: 348252139字节
- 数据集大小: 668848106字节
4. stanford
- 特征:
- text_token_length: 整数
- prompt: 字符串
- text: 字符串
- seed_data: 字符串
- format: 字符串
- audience: 字符串
- quality_score_v2: 浮点数
- 分割:
- train: 1020024个样本,大小6349451698字节
- 下载大小: 3309575750字节
- 数据集大小: 6349451698字节
5. stories
- 特征:
- text: 字符串
- prompt: 字符串
- text_token_length: 整数
- seed_data: 字符串
- format: 字符串
- audience: 字符串
- quality_score_v2: 浮点数
- 分割:
- train: 4992964个样本,大小21354683360字节
- 下载大小: 11894225941字节
- 数据集大小: 21354683360字节
6. web_samples_v1
- 特征:
- text_token_length: 整数
- prompt: 字符串
- text: 字符串
- seed_data: 字符串
- format: 字符串
- audience: 字符串
- quality_score_v2: 浮点数
- 分割:
- train: 12426348个样本,大小69175137079字节
- 下载大小: 39069495077字节
- 数据集大小: 69175137079字节
7. web_samples_v2
- 特征:
- text_token_length: 整数
- prompt: 字符串
- text: 字符串
- seed_data: 字符串
- format: 字符串
- audience: 字符串
- quality_score_v2: 浮点数
- 分割:
- train: 10345867个样本,大小58794569875字节
- 下载大小: 32737483363字节
- 数据集大小: 58794569875字节
8. wikihow
- 特征:
- text_token_length: 整数
- prompt: 字符串
- text: 字符串
- seed_data: 字符串
- format: 字符串
- audience: 字符串
- quality_score_v2: 浮点数
- 分割:
- train: 179191个样本,大小894154056字节
- 下载大小: 503349411字节
- 数据集大小: 894154056字节



