kenhktsui/cosmopedia_quality_score_v1v2
收藏Hugging Face2024-05-28 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/kenhktsui/cosmopedia_quality_score_v1v2
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
- config_name: auto_math_text
features:
- name: prompt
dtype: string
- name: text_token_length
dtype: int64
- name: text
dtype: string
- name: seed_data
dtype: string
- name: format
dtype: string
- name: audience
dtype: string
- name: quality_score_v2
dtype: float64
- name: quality_score_v1
struct:
- name: label
dtype: string
- name: score
dtype: float32
splits:
- name: train
num_bytes: 8818346409
num_examples: 1949895
download_size: 4470498060
dataset_size: 8818346409
- config_name: khanacademy
features:
- name: prompt
dtype: string
- name: text_token_length
dtype: int64
- name: text
dtype: string
- name: seed_data
dtype: string
- name: format
dtype: string
- name: audience
dtype: string
- name: quality_score_v2
dtype: float64
- name: quality_score_v1
struct:
- name: label
dtype: string
- name: score
dtype: float32
splits:
- name: train
num_bytes: 122638384
num_examples: 24123
download_size: 49096546
dataset_size: 122638384
- config_name: openstax
features:
- name: text_token_length
dtype: int64
- name: prompt
dtype: string
- name: text
dtype: string
- name: seed_data
dtype: string
- name: format
dtype: string
- name: audience
dtype: string
- name: quality_score_v2
dtype: float64
- name: quality_score_v1
struct:
- name: label
dtype: string
- name: score
dtype: float32
splits:
- name: train
num_bytes: 671345130
num_examples: 126332
download_size: 349014722
dataset_size: 671345130
- config_name: stanford
features:
- name: text_token_length
dtype: int64
- name: prompt
dtype: string
- name: text
dtype: string
- name: seed_data
dtype: string
- name: format
dtype: string
- name: audience
dtype: string
- name: quality_score_v2
dtype: float64
- name: quality_score_v1
struct:
- name: label
dtype: string
- name: score
dtype: float32
splits:
- name: train
num_bytes: 6369731642
num_examples: 1020024
download_size: 3315737531
dataset_size: 6369731642
- config_name: stories
features:
- name: text
dtype: string
- name: prompt
dtype: string
- name: text_token_length
dtype: int64
- name: seed_data
dtype: string
- name: format
dtype: string
- name: audience
dtype: string
- name: quality_score_v2
dtype: float64
- name: quality_score_v1
struct:
- name: label
dtype: string
- name: score
dtype: float32
splits:
- name: train
num_bytes: 21452301637
num_examples: 4992964
download_size: 11924260292
dataset_size: 21452301637
- config_name: web_samples_v1
features:
- name: text_token_length
dtype: int64
- name: prompt
dtype: string
- name: text
dtype: string
- name: seed_data
dtype: string
- name: format
dtype: string
- name: audience
dtype: string
- name: quality_score_v2
dtype: float64
- name: quality_score_v1
struct:
- name: label
dtype: string
- name: score
dtype: float32
splits:
- name: train
num_bytes: 69420017340
num_examples: 12426348
download_size: 39144367823
dataset_size: 69420017340
- config_name: web_samples_v2
features:
- name: text_token_length
dtype: int64
- name: prompt
dtype: string
- name: text
dtype: string
- name: seed_data
dtype: string
- name: format
dtype: string
- name: audience
dtype: string
- name: quality_score_v2
dtype: float64
- name: quality_score_v1
struct:
- name: label
dtype: string
- name: score
dtype: float32
splits:
- name: train
num_bytes: 58998735831
num_examples: 10345867
download_size: 32799841057
dataset_size: 58998735831
- config_name: wikihow
features:
- name: text_token_length
dtype: int64
- name: prompt
dtype: string
- name: text
dtype: string
- name: seed_data
dtype: string
- name: format
dtype: string
- name: audience
dtype: string
- name: quality_score_v2
dtype: float64
- name: quality_score_v1
struct:
- name: label
dtype: string
- name: score
dtype: float32
splits:
- name: train
num_bytes: 897695890
num_examples: 179191
download_size: 504430220
dataset_size: 897695890
configs:
- config_name: auto_math_text
data_files:
- split: train
path: auto_math_text/train-*
- config_name: khanacademy
data_files:
- split: train
path: khanacademy/train-*
- config_name: openstax
data_files:
- split: train
path: openstax/train-*
- config_name: stanford
data_files:
- split: train
path: stanford/train-*
- config_name: stories
data_files:
- split: train
path: stories/train-*
- config_name: web_samples_v1
data_files:
- split: train
path: web_samples_v1/train-*
- config_name: web_samples_v2
data_files:
- split: train
path: web_samples_v2/train-*
- config_name: wikihow
data_files:
- split: train
path: wikihow/train-*
---
The dataset includes multiple configurations, each with features such as text, prompt, text token length, seed data, format, audience, and quality scores (v1 and v2). Each configuration has a training dataset with specified data sizes and example counts. The dataset is primarily intended for text-related machine learning tasks.
提供机构:
kenhktsui
原始信息汇总
数据集概述
1. auto_math_text
- 特征:
- prompt: 字符串
- text_token_length: 整数
- text: 字符串
- seed_data: 字符串
- format: 字符串
- audience: 字符串
- quality_score_v2: 浮点数
- quality_score_v1: 结构体(label: 字符串, score: 浮点数)
- 分割:
- train: 1949895个样本, 8818346409字节
- 下载大小: 4470498060字节
- 数据集大小: 8818346409字节
2. khanacademy
- 特征:
- prompt: 字符串
- text_token_length: 整数
- text: 字符串
- seed_data: 字符串
- format: 字符串
- audience: 字符串
- quality_score_v2: 浮点数
- quality_score_v1: 结构体(label: 字符串, score: 浮点数)
- 分割:
- train: 24123个样本, 122638384字节
- 下载大小: 49096546字节
- 数据集大小: 122638384字节
3. openstax
- 特征:
- text_token_length: 整数
- prompt: 字符串
- text: 字符串
- seed_data: 字符串
- format: 字符串
- audience: 字符串
- quality_score_v2: 浮点数
- quality_score_v1: 结构体(label: 字符串, score: 浮点数)
- 分割:
- train: 126332个样本, 671345130字节
- 下载大小: 349014722字节
- 数据集大小: 671345130字节
4. stanford
- 特征:
- text_token_length: 整数
- prompt: 字符串
- text: 字符串
- seed_data: 字符串
- format: 字符串
- audience: 字符串
- quality_score_v2: 浮点数
- quality_score_v1: 结构体(label: 字符串, score: 浮点数)
- 分割:
- train: 1020024个样本, 6369731642字节
- 下载大小: 3315737531字节
- 数据集大小: 6369731642字节
5. stories
- 特征:
- text: 字符串
- prompt: 字符串
- text_token_length: 整数
- seed_data: 字符串
- format: 字符串
- audience: 字符串
- quality_score_v2: 浮点数
- quality_score_v1: 结构体(label: 字符串, score: 浮点数)
- 分割:
- train: 4992964个样本, 21452301637字节
- 下载大小: 11924260292字节
- 数据集大小: 21452301637字节
6. web_samples_v1
- 特征:
- text_token_length: 整数
- prompt: 字符串
- text: 字符串
- seed_data: 字符串
- format: 字符串
- audience: 字符串
- quality_score_v2: 浮点数
- quality_score_v1: 结构体(label: 字符串, score: 浮点数)
- 分割:
- train: 12426348个样本, 69420017340字节
- 下载大小: 39144367823字节
- 数据集大小: 69420017340字节
7. web_samples_v2
- 特征:
- text_token_length: 整数
- prompt: 字符串
- text: 字符串
- seed_data: 字符串
- format: 字符串
- audience: 字符串
- quality_score_v2: 浮点数
- quality_score_v1: 结构体(label: 字符串, score: 浮点数)
- 分割:
- train: 10345867个样本, 58998735831字节
- 下载大小: 32799841057字节
- 数据集大小: 58998735831字节
8. wikihow
- 特征:
- text_token_length: 整数
- prompt: 字符串
- text: 字符串
- seed_data: 字符串
- format: 字符串
- audience: 字符串
- quality_score_v2: 浮点数
- quality_score_v1: 结构体(label: 字符串, score: 浮点数)
- 分割:
- train: 179191个样本, 897695890字节
- 下载大小: 504430220字节
- 数据集大小: 897695890字节
搜集汇总
数据集介绍

背景与挑战
背景概述
该数据集是一个大规模的教育文本质量评分数据集,包含超过3100万行数据,涵盖多个来源如数学文本、可汗学院和维基百科等。其关键特点在于为每段文本提供了两个版本的质量评分(quality_score_v1和quality_score_v2),用于评估文本的适用性和教育价值,针对不同受众(如小学生和大学生)进行优化。
以上内容由遇见数据集搜集并总结生成



