vikp/textbook_quality_programming
收藏Hugging Face2023-10-08 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/vikp/textbook_quality_programming
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
dataset_info:
features:
- name: topic
dtype: string
- name: model
dtype: string
- name: concepts
sequence: string
- name: outline
sequence: string
- name: markdown
dtype: string
splits:
- name: train
num_bytes: 471931604
num_examples: 11650
download_size: 0
dataset_size: 471931604
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
---
# Dataset Card for "textbook_quality_programming"
Synthetic programming textbooks generated with GPT-3.5 and retrieval. Very high quality, aimed at being used in a phi replication. Currently 115M tokens. Covers many languages and technologies, with a bias towards python.
~10k of the books (65M tokens) use an older generation method, and average 6k tokens in length. ~1.5k books (50M tokens) use a newer generation method, with a more detailed outline, and average 33k tokens in length. All books have section headers for optimal chunking.
Generated using the [textbook_quality](https://github.com/VikParuchuri/textbook_quality) repo.
提供机构:
vikp
原始信息汇总
数据集概述
数据集名称
- 名称: textbook_quality_programming
数据集描述
- 生成方式: 使用GPT-3.5和检索技术生成的合成编程教科书。
- 质量目标: 非常高,旨在用于phi复制。
- 数据量: 当前包含115M令牌。
- 内容覆盖: 涵盖多种语言和技术,偏向于Python。
数据集特征
- 特征列表:
topic: 字符串类型model: 字符串类型concepts: 字符串序列类型outline: 字符串序列类型markdown: 字符串类型
数据集分割
- 分割详情:
train:- 示例数量: 11650
- 数据大小: 471931604字节
数据集大小
- 下载大小: 0字节
- 数据集总大小: 471931604字节
数据文件配置
- 配置名称: default
- 数据文件路径:
train: data/train-*



