---
title: 'Phi-1 Model Dataset'
date: '2023-07-03'
license: cc-by-nc-sa-3.0
---
## Dataset Description
- **Homepage:** [teleprint.me](https://teleprint.me)
- **Repository:** [phi-1](https://huggingface.co/datasets/teleprint-me/phi-1)
- **Paper:** [2306.11644v1](https://arxiv.org/abs/2306.11644v1)
- **Leaderboard:** [Link to the leaderboard]
- **Point of Contact:** [aberrio@teleprint.me](aberrio@teleprint.me)
### Dataset Summary
This dataset is created for training the phi-1 model, based on the paper
"Textbooks are All You Need". It contains high-quality data derived from various
textbooks, transformed and synthesized using OpenAI's GPT-3.5 and GPT-4 models.
For optimal results, it is recommended to train models with the following
parameters and sequence lengths:
- For a model with 350M parameters, use a sequence length of 2048.
- For a model with 700M parameters, use a sequence length of 4096.
- For a model with 1.3B parameters, use a sequence length of 8096.
Please note that the dataset is currently in its initial phase of planning and
collection. The process involves preparing the data, extracting it, formatting
it, chunking it, and preparing it for synthesis. Scripts for preparing and
processing the data for the model will be developed. Once the data is generated,
it will undergo a review and revision process to ensure its quality and
relevance.
These recommendations and notes are based on the dataset creator's initial plans
and may be subject to change as the project progresses.
**NOTE**: Due to the nature of this dataset, it cannot be released without
obtaining permissions from the respective publishers and/or authors. If you are
an author or publisher and have any concerns about this repository, please feel
free to email me.
If you are an author or publisher and would like to grant permission for the use
of your work, your support would be greatly appreciated. Please note that in
order for the dataset to be released, permissions would need to be unanimous
from all involved parties.
In the absence of such permissions, I will respect the copyrights of the
copyrighted materials and exercise my right to Fair Use with my own physical
property for personal use.
**This dataset is NOT intended for commercial purposes**. Its primary purpose is
for research in machine learning and AI software development. If a model is
created using this dataset, it will be shared under the same license.
Any proceeds derived from donations will be primarily used for the development
of the dataset and the model.
### Supported Tasks and Leaderboards
- `text-generation`: The dataset can be used to train a model for chat-like text
generation, more specifically, for generating explanations and examples in the
context of arithmetic, algebra, geometry, trigonometry, calculus, algorithms
and data structures, design patterns, and the python programming language.
### Languages
The text in the dataset is in English.
## Dataset Structure
### Data Instances
A data instance consists of a dialogue between a user and an assistant,
discussing a topic in arithmetic, algebra, geometry, trigonometry, calculus,
algorithms and data structures, design patterns, or the Python programming
language. The dialogue is structured as a list of turns, each turn containing
the role ("user" or "assistant") and the content of the turn.
### Data Fields
- `role`: a string indicating the role of the speaker in the dialogue ("system",
"user", "assistant", "function").
- `content`: a string containing the content of the speaker's turn in the
dialogue.
### Data Splits
The dataset is split into a training set, a validation set, and a test set. The
exact sizes and proportions of these splits will depend on the final size of the
dataset.
## Dataset Creation
### Curation Rationale
The dataset is being created to train a model capable of generating explanations
and examples in the context of various mathematical and computer science topics.
The goal is to create an AI assistant that can provide clear, accurate, and
pedagogically sound responses to user queries on these topics.
### Source Data
#### Initial Data Collection and Normalization
The data is collected from a variety of textbooks covering arithmetic, algebra,
geometry, trigonometry, calculus, algorithms and data structures, design
patterns, and the Python programming language. The textbooks used include:
- Barron's Arithmetic The Easy Way Fourth Edition
- Blitzer Introductory Algebra for College Students Fifth Edition
- McDougal Littell Geometry
- Blitzer Intermediate Algebra for College Students 5th Edition
- Trigonometry Sixth Edition
- Pearson College Algebra Fourth Edition
- Hughes-Hallet Applied Calculus 5th Edition
- CLRS Introduction to Algorithms Third Edition
In addition to the textbooks, the dataset also includes material from the
following online resources:
- [C reference](https://en.cppreference.com/w/c)
- [Cpp reference](https://en.cppreference.com/w/cpp)
- [Python Standard Library](https://docs.python.org/3/)
These resources provide up-to-date information and examples for the C, C++, and
Python programming languages. The creators of the Cppreference site also provide
[archives](https://en.cppreference.com/w/Cppreference:Archives) of their site
for offline use. Code samples synthesized by OpenAI's GPT models, curated by the
dataset creator, are also included in the dataset.
**Note:** The creator of this dataset owns physical copies of all the textbooks
listed above. The data from these sources are transformed into a dialogue format
using OpenAI's GPT-3.5 and GPT-4 models. The resulting dialogues are then used
as the training data for the phi-1 model. This dataset does not include the full
content of the source textbooks. Instead, it consists of transformations and
syntheses of the original content. Anyone who wants access to the full original
content should purchase or otherwise legally access the textbooks themselves.
#### Who are the source language producers?
The original language data was created by a variety of authors and educators,
who wrote the textbooks and other materials used as sources for this dataset.
These include:
- Barron's Arithmetic The Easy Way Fourth Edition - Edward Williams, Katie
Prindle
- Blitzer Introductory Algebra for College Students Fifth Edition - Robert
Blitzer
- McDougal Littell Geometry - Ron Larson, Laurie Boswell, Timothy D. Kanold, Lee
Stiff
- Blitzer Intermediate Algebra for College Students 5th Edition - Robert Blitzer
- Trigonometry Sixth Edition - Charles P. McKeague, Mark D. Turner
- Pearson College Algebra Fourth Edition - Robert F. Blitzer
- Hughes-Hallet Applied Calculus 5th Edition - Deborah Hughes-Hallett, Andrew M.
Gleason, Patti Frazer Lock, Daniel E. Flath, Sheldon P. Gordon, David O.
Lomen, David Lovelock, William G. McCallum, Brad G. Osgood, Andrew Pasquale,
Jeff Tecosky-Feldman, Joseph Thrash, Karen R. Rhea, Thomas W. Tucker
- CLRS Introduction to Algorithms Third Edition - Thomas H. Cormen, Charles E.
Leiserson, Ronald L. Rivest, Clifford Stein
In addition to these authors, the developers of OpenAI's GPT-3.5 and GPT-4
models also contributed to the creation of the language data, as these models
were used to transform the source material into a dialogue format.
### Annotations
#### Annotation process
The dataset does not contain any explicit annotations. However, the data is
curated and synthesized using OpenAI's GPT-3.5 and GPT-4 models. The process
involves transforming the source material into a dialogue format suitable for
training the phi-1 model. The dataset creator, an independent learner with a
strong interest in computer science, reviewed and curated the synthesized
dialogues to ensure their quality and relevance.
#### Who are the annotators?
The dataset creator, an independent learner who has studied computer science
extensively in a self-directed manner, performed the curation and review of the
synthesized dialogues.
### Personal and Sensitive Information
The dataset does not contain any personal or sensitive information. All the data
is derived from publicly available textbooks and online resources. Any names or
other potential identifiers in the source material have been removed or
anonymized.
### Social Impact of Dataset
The dataset is intended to support the development of AI models capable of
providing detailed explanations and examples in the context of arithmetic,
algebra, geometry, trigonometry, calculus, algorithms and data structures,
design patterns, and the python programming language. The potential social
impact is significant, as such models could greatly enhance self-directed
learning and provide valuable educational support to students worldwide.
However, it's important to note that the quality and usefulness of the AI models
trained on this dataset will depend on the quality of the data itself. If the
data is inaccurate or biased, the models could propagate these inaccuracies and
biases, potentially leading to misinformation or unfair outcomes.
### Discussion of Biases
The dataset is based on a variety of textbooks and online resources, which may
contain their own inherent biases. For example, textbooks often reflect the
perspectives and biases of their authors, which can influence the way
information is presented. These biases could potentially be reflected in the
dataset and in any models trained on it.
### Other Known Limitations
At this stage of the dataset creation process, it's difficult to identify all
potential limitations. However, one potential limitation is that the dataset may
not cover all possible topics or perspectives within the fields it addresses.
The dataset creator will continue to monitor and assess the dataset for
limitations as the work progresses.
## Additional Information
### Dataset Curators
The dataset was curated by an independent learner with a strong interest in
computer science. The curator has studied the subject matter in a self-directed
manner, using a variety of resources including textbooks and online materials.
The curation process also involved the use of OpenAI's GPT-3.5 and GPT-4 models
to synthesize dialogues based on the source material.
### Licensing Information
This dataset is released under the Creative Commons
Attribution-NonCommercial-ShareAlike 3.0 International (CC BY-NC-SA 3.0)
license.
### Citation Information
As this dataset is a compilation of various sources synthesized and curated for
the purpose of training the phi-1 model, please ensure to cite the original
sources when using this dataset. If referencing the dataset directly, please
refer to this repository.
### 元数据
- 标题:Phi-1 模型数据集
- 发布日期:2023-07-03
- 许可证:知识共享署名-非商业性使用-相同方式共享3.0国际许可协议(Creative Commons Attribution-NonCommercial-ShareAlike 3.0,CC BY-NC-SA 3.0)
## 数据集描述
- 项目主页:[teleprint.me](https://teleprint.me)
- 代码仓库:[phi-1](https://huggingface.co/datasets/teleprint-me/phi-1)
- 相关论文:[arXiv:2306.11644v1](https://arxiv.org/abs/2306.11644v1)
- 排行榜:[排行榜链接]
- 联系方式:[aberrio@teleprint.me](aberrio@teleprint.me)
## 数据集概述
本数据集用于训练phi-1模型,基于论文《教科书皆为所需(Textbooks are All You Need)》构建。数据集包含源自各类教科书的高质量数据,通过OpenAI的大语言模型(Large Language Model)GPT-3.5和GPT-4进行转换与合成。
为获得最优训练效果,建议结合模型参数量与序列长度开展训练:
- 参数量为3.5亿的模型,建议序列长度为2048
- 参数量为7亿的模型,建议序列长度为4096
- 参数量为13亿的模型,建议序列长度为8096
请注意,本数据集目前处于规划与采集的初始阶段,整体流程涵盖数据准备、提取、格式化、分块与合成预处理。用于模型数据准备与处理的脚本尚在开发中。数据生成后,将经过审核与修订流程以确保其质量与相关性。
上述建议与说明基于数据集创建者的初始规划,可能随项目推进发生变更。
**注意**:由于本数据集的特性,未经相关出版商和/或作者许可,无法公开发布。若您是相关作品的作者或出版商,且对本仓库有任何疑问,请随时致邮联系。
若您是作者或出版商,愿意授权使用您的作品,我们将不胜感激。请注意,本数据集的发布需要所有相关方的一致许可。
在未获得此类许可的情况下,我们将尊重受版权保护材料的著作权,并为个人使用行使合理使用(Fair Use)权利。
**本数据集不得用于商业用途**,其核心用途为机器学习与人工智能软件开发研究。若基于本数据集创建模型,需采用相同许可协议进行共享。
捐赠所得款项将主要用于数据集与模型的开发工作。
## 支持任务与排行榜
- `文本生成(text-generation)`:本数据集可用于训练类聊天文本生成模型,具体可用于生成算术、代数、几何、三角学、微积分、算法与数据结构、设计模式以及Python编程语言相关场景下的解释与示例。
## 语言说明
数据集文本语言为英语。
## 数据集结构
### 数据实例
一条数据实例由用户与助手的对话组成,讨论算术、代数、几何、三角学、微积分、算法与数据结构、设计模式或Python编程语言相关主题。对话以轮次列表的形式组织,每一轮包含发言角色("user"(用户)或"assistant"(助手))与对应发言内容。
### 数据字段
- `role`:字符串类型,表示对话中发言者的角色,可选值为"system"(系统)、"user"(用户)、"assistant"(助手)或"function"(函数)。
- `content`:字符串类型,包含对话中发言者本轮的内容。
### 数据划分
本数据集划分为训练集、验证集与测试集。各划分的具体规模与比例将取决于数据集最终的总规模。
## 数据集创建
### 数据集构建依据
本数据集旨在训练能够针对各类数学与计算机科学主题生成解释与示例的模型,目标是打造一款能够针对上述主题的用户查询提供清晰、准确且符合教学逻辑的回复的AI助手。
### 源数据
#### 初始数据采集与标准化
本数据集的数据源自覆盖算术、代数、几何、三角学、微积分、算法与数据结构、设计模式以及Python编程语言的各类教科书,涉及的教科书包括:
- 《Barron's Arithmetic The Easy Way Fourth Edition》(《巴朗算术入门 第4版》)
- 《Blitzer Introductory Algebra for College Students Fifth Edition》(《布利策大学初等代数 第5版》)
- 《McDougal Littell Geometry》(《麦克道格尔·利特尔几何学》)
- 《Blitzer Intermediate Algebra for College Students 5th Edition》(《布利策大学中等代数 第5版》)
- 《Trigonometry Sixth Edition》(《三角学 第6版》)
- 《Pearson College Algebra Fourth Edition》(《培生大学代数 第4版》)
- 《Hughes-Hallet Applied Calculus 5th Edition》(《休斯-哈莱特应用微积分 第5版》)
- 《CLRS Introduction to Algorithms Third Edition》(《算法导论 第3版》)
除教科书外,数据集还包含以下在线资源的内容:
- [C语言参考文档](https://en.cppreference.com/w/c)
- [C++语言参考文档](https://en.cppreference.com/w/cpp)
- [Python标准库文档](https://docs.python.org/3/)
这些资源可为C、C++与Python编程语言提供最新的信息与示例。Cppreference网站的开发者还提供了[离线归档包](https://en.cppreference.com/w/Cppreference:Archives)供离线使用。数据集还包含由OpenAI的GPT模型合成、经数据集创建者整理的代码示例。
**注意**:本数据集创建者拥有上述所有列出教科书的实体副本。来自这些来源的数据通过OpenAI的GPT-3.5和GPT-4模型转换为对话格式,生成的对话将作为phi-1模型的训练数据。本数据集不包含源教科书的完整内容,仅包含原始内容的转换与合成结果。若希望获取完整的原始内容,请自行购买或通过合法途径获取相关教科书。
#### 源语言内容的创作者是谁?
原始语言数据由各类作者与教育工作者创作,他们是作为本数据集源材料的教科书与其他材料的创作者,具体包括:
- 《Barron's Arithmetic The Easy Way Fourth Edition》:Edward Williams、Katie Prindle
- 《Blitzer Introductory Algebra for College Students Fifth Edition》:Robert Blitzer
- 《McDougal Littell Geometry》:Ron Larson、Laurie Boswell、Timothy D. Kanold、Lee Stiff
- 《Blitzer Intermediate Algebra for College Students 5th Edition》:Robert Blitzer
- 《Trigonometry Sixth Edition》:Charles P. McKeague、Mark D. Turner
- 《Pearson College Algebra Fourth Edition》:Robert F. Blitzer
- 《Hughes-Hallet Applied Calculus 5th Edition》:Deborah Hughes-Hallett、Andrew M. Gleason、Patti Frazer Lock、Daniel E. Flath、Sheldon P. Gordon、David O. Lomen、David Lovelock、William G. McCallum、Brad G. Osgood、Andrew Pasquale、Jeff Tecosky-Feldman、Joseph Thrash、Karen R. Rhea、Thomas W. Tucker
- 《CLRS Introduction to Algorithms Third Edition》:Thomas H. Cormen、Charles E. Leiserson、Ronald L. Rivest、Clifford Stein
除上述作者外,OpenAI的GPT-3.5与GPT-4模型的开发者也为语言数据的创建做出了贡献,因为这些模型被用于将源材料转换为对话格式。
### 标注信息
#### 标注流程
本数据集未包含显式标注,但数据通过OpenAI的GPT-3.5和GPT-4模型进行整理与合成。流程涉及将源材料转换为适合训练phi-1模型的对话格式。数据集创建者作为一名对计算机科学抱有浓厚兴趣的自主学习者,对合成的对话进行了审核与整理,以确保其质量与相关性。
#### 标注者是谁?
数据集创建者,一名通过自主方式深入学习计算机科学的独立学习者,完成了合成对话的整理与审核工作。
### 个人与敏感信息
本数据集未包含任何个人或敏感信息。所有数据均源自公开可用的教科书与在线资源。源材料中的任何姓名或其他潜在标识符均已被移除或匿名化处理。
### 数据集的社会影响
本数据集旨在支持开发能够针对算术、代数、几何、三角学、微积分、算法与数据结构、设计模式以及Python编程语言相关场景生成详细解释与示例的AI模型。其潜在社会影响显著,此类模型可极大助力自主学习,并为全球学生提供宝贵的教育支持。
然而,需要注意的是,基于本数据集训练的AI模型的质量与实用性将取决于数据本身的质量。若数据存在不准确或偏见,模型可能会传播这些不准确之处与偏见,进而可能导致错误信息传播或不公平结果。
### 偏见讨论
本数据集基于各类教科书与在线资源,这些资源可能本身带有固有偏见。例如,教科书往往反映其作者的观点与偏见,这会影响信息的呈现方式。此类偏见可能会反映在数据集以及基于其训练的模型中。
### 其他已知局限性
在数据集创建的当前阶段,难以识别所有潜在局限性。但其中一个潜在局限性是,数据集可能未覆盖其所涉及领域内的所有主题与视角。数据集创建者将在项目推进过程中持续监测与评估数据集的局限性。
## 附加信息
### 数据集整理者
本数据集由一名对计算机科学抱有浓厚兴趣的独立学习者整理。该整理者通过包括教科书与在线材料在内的各类资源自主学习相关主题。整理过程还使用了OpenAI的GPT-3.5和GPT-4模型,基于源材料合成对话。
### 许可信息
本数据集采用知识共享署名-非商业性使用-相同方式共享3.0国际许可协议(Creative Commons Attribution-NonCommercial-ShareAlike 3.0,CC BY-NC-SA 3.0)进行发布。
### 引用信息
由于本数据集是为训练phi-1模型而合成与整理的各类源材料的汇编,使用本数据集时请务必引用原始源材料。若直接引用本数据集,请参照本仓库进行引用。