five

teleprint-me/phi-1|教育数据集|自然语言处理数据集

收藏
hugging_face2023-07-08 更新2024-03-04 收录
教育
自然语言处理
下载链接:
https://hf-mirror.com/datasets/teleprint-me/phi-1
下载链接
链接失效反馈
资源简介:
--- title: 'Phi-1 Model Dataset' date: '2023-07-03' license: cc-by-nc-sa-3.0 --- ## Dataset Description - **Homepage:** [teleprint.me](https://teleprint.me) - **Repository:** [phi-1](https://huggingface.co/datasets/teleprint-me/phi-1) - **Paper:** [2306.11644v1](https://arxiv.org/abs/2306.11644v1) - **Leaderboard:** [Link to the leaderboard] - **Point of Contact:** [aberrio@teleprint.me](aberrio@teleprint.me) ### Dataset Summary This dataset is created for training the phi-1 model, based on the paper "Textbooks are All You Need". It contains high-quality data derived from various textbooks, transformed and synthesized using OpenAI's GPT-3.5 and GPT-4 models. For optimal results, it is recommended to train models with the following parameters and sequence lengths: - For a model with 350M parameters, use a sequence length of 2048. - For a model with 700M parameters, use a sequence length of 4096. - For a model with 1.3B parameters, use a sequence length of 8096. Please note that the dataset is currently in its initial phase of planning and collection. The process involves preparing the data, extracting it, formatting it, chunking it, and preparing it for synthesis. Scripts for preparing and processing the data for the model will be developed. Once the data is generated, it will undergo a review and revision process to ensure its quality and relevance. These recommendations and notes are based on the dataset creator's initial plans and may be subject to change as the project progresses. **NOTE**: Due to the nature of this dataset, it cannot be released without obtaining permissions from the respective publishers and/or authors. If you are an author or publisher and have any concerns about this repository, please feel free to email me. If you are an author or publisher and would like to grant permission for the use of your work, your support would be greatly appreciated. Please note that in order for the dataset to be released, permissions would need to be unanimous from all involved parties. In the absence of such permissions, I will respect the copyrights of the copyrighted materials and exercise my right to Fair Use with my own physical property for personal use. **This dataset is NOT intended for commercial purposes**. Its primary purpose is for research in machine learning and AI software development. If a model is created using this dataset, it will be shared under the same license. Any proceeds derived from donations will be primarily used for the development of the dataset and the model. ### Supported Tasks and Leaderboards - `text-generation`: The dataset can be used to train a model for chat-like text generation, more specifically, for generating explanations and examples in the context of arithmetic, algebra, geometry, trigonometry, calculus, algorithms and data structures, design patterns, and the python programming language. ### Languages The text in the dataset is in English. ## Dataset Structure ### Data Instances A data instance consists of a dialogue between a user and an assistant, discussing a topic in arithmetic, algebra, geometry, trigonometry, calculus, algorithms and data structures, design patterns, or the Python programming language. The dialogue is structured as a list of turns, each turn containing the role ("user" or "assistant") and the content of the turn. ### Data Fields - `role`: a string indicating the role of the speaker in the dialogue ("system", "user", "assistant", "function"). - `content`: a string containing the content of the speaker's turn in the dialogue. ### Data Splits The dataset is split into a training set, a validation set, and a test set. The exact sizes and proportions of these splits will depend on the final size of the dataset. ## Dataset Creation ### Curation Rationale The dataset is being created to train a model capable of generating explanations and examples in the context of various mathematical and computer science topics. The goal is to create an AI assistant that can provide clear, accurate, and pedagogically sound responses to user queries on these topics. ### Source Data #### Initial Data Collection and Normalization The data is collected from a variety of textbooks covering arithmetic, algebra, geometry, trigonometry, calculus, algorithms and data structures, design patterns, and the Python programming language. The textbooks used include: - Barron's Arithmetic The Easy Way Fourth Edition - Blitzer Introductory Algebra for College Students Fifth Edition - McDougal Littell Geometry - Blitzer Intermediate Algebra for College Students 5th Edition - Trigonometry Sixth Edition - Pearson College Algebra Fourth Edition - Hughes-Hallet Applied Calculus 5th Edition - CLRS Introduction to Algorithms Third Edition In addition to the textbooks, the dataset also includes material from the following online resources: - [C reference](https://en.cppreference.com/w/c) - [Cpp reference](https://en.cppreference.com/w/cpp) - [Python Standard Library](https://docs.python.org/3/) These resources provide up-to-date information and examples for the C, C++, and Python programming languages. The creators of the Cppreference site also provide [archives](https://en.cppreference.com/w/Cppreference:Archives) of their site for offline use. Code samples synthesized by OpenAI's GPT models, curated by the dataset creator, are also included in the dataset. **Note:** The creator of this dataset owns physical copies of all the textbooks listed above. The data from these sources are transformed into a dialogue format using OpenAI's GPT-3.5 and GPT-4 models. The resulting dialogues are then used as the training data for the phi-1 model. This dataset does not include the full content of the source textbooks. Instead, it consists of transformations and syntheses of the original content. Anyone who wants access to the full original content should purchase or otherwise legally access the textbooks themselves. #### Who are the source language producers? The original language data was created by a variety of authors and educators, who wrote the textbooks and other materials used as sources for this dataset. These include: - Barron's Arithmetic The Easy Way Fourth Edition - Edward Williams, Katie Prindle - Blitzer Introductory Algebra for College Students Fifth Edition - Robert Blitzer - McDougal Littell Geometry - Ron Larson, Laurie Boswell, Timothy D. Kanold, Lee Stiff - Blitzer Intermediate Algebra for College Students 5th Edition - Robert Blitzer - Trigonometry Sixth Edition - Charles P. McKeague, Mark D. Turner - Pearson College Algebra Fourth Edition - Robert F. Blitzer - Hughes-Hallet Applied Calculus 5th Edition - Deborah Hughes-Hallett, Andrew M. Gleason, Patti Frazer Lock, Daniel E. Flath, Sheldon P. Gordon, David O. Lomen, David Lovelock, William G. McCallum, Brad G. Osgood, Andrew Pasquale, Jeff Tecosky-Feldman, Joseph Thrash, Karen R. Rhea, Thomas W. Tucker - CLRS Introduction to Algorithms Third Edition - Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, Clifford Stein In addition to these authors, the developers of OpenAI's GPT-3.5 and GPT-4 models also contributed to the creation of the language data, as these models were used to transform the source material into a dialogue format. ### Annotations #### Annotation process The dataset does not contain any explicit annotations. However, the data is curated and synthesized using OpenAI's GPT-3.5 and GPT-4 models. The process involves transforming the source material into a dialogue format suitable for training the phi-1 model. The dataset creator, an independent learner with a strong interest in computer science, reviewed and curated the synthesized dialogues to ensure their quality and relevance. #### Who are the annotators? The dataset creator, an independent learner who has studied computer science extensively in a self-directed manner, performed the curation and review of the synthesized dialogues. ### Personal and Sensitive Information The dataset does not contain any personal or sensitive information. All the data is derived from publicly available textbooks and online resources. Any names or other potential identifiers in the source material have been removed or anonymized. ### Social Impact of Dataset The dataset is intended to support the development of AI models capable of providing detailed explanations and examples in the context of arithmetic, algebra, geometry, trigonometry, calculus, algorithms and data structures, design patterns, and the python programming language. The potential social impact is significant, as such models could greatly enhance self-directed learning and provide valuable educational support to students worldwide. However, it's important to note that the quality and usefulness of the AI models trained on this dataset will depend on the quality of the data itself. If the data is inaccurate or biased, the models could propagate these inaccuracies and biases, potentially leading to misinformation or unfair outcomes. ### Discussion of Biases The dataset is based on a variety of textbooks and online resources, which may contain their own inherent biases. For example, textbooks often reflect the perspectives and biases of their authors, which can influence the way information is presented. These biases could potentially be reflected in the dataset and in any models trained on it. ### Other Known Limitations At this stage of the dataset creation process, it's difficult to identify all potential limitations. However, one potential limitation is that the dataset may not cover all possible topics or perspectives within the fields it addresses. The dataset creator will continue to monitor and assess the dataset for limitations as the work progresses. ## Additional Information ### Dataset Curators The dataset was curated by an independent learner with a strong interest in computer science. The curator has studied the subject matter in a self-directed manner, using a variety of resources including textbooks and online materials. The curation process also involved the use of OpenAI's GPT-3.5 and GPT-4 models to synthesize dialogues based on the source material. ### Licensing Information This dataset is released under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 International (CC BY-NC-SA 3.0) license. ### Citation Information As this dataset is a compilation of various sources synthesized and curated for the purpose of training the phi-1 model, please ensure to cite the original sources when using this dataset. If referencing the dataset directly, please refer to this repository.
提供机构:
teleprint-me
原始信息汇总

Phi-1 模型数据集

数据集描述

数据集概述

该数据集是为训练 phi-1 模型而创建的,基于论文 "Textbooks are All You Need"。它包含从各种教科书中提取的高质量数据,通过 OpenAI 的 GPT-3.5 和 GPT-4 模型进行转换和合成。

推荐参数和序列长度

  • 对于 350M 参数的模型,使用 2048 的序列长度。
  • 对于 700M 参数的模型,使用 4096 的序列长度。
  • 对于 1.3B 参数的模型,使用 8096 的序列长度。

数据集状态

数据集目前处于规划和收集的初始阶段。数据准备、提取、格式化、分块和合成准备过程中将开发相关脚本。数据生成后,将进行审查和修订以确保其质量和相关性。

使用限制

该数据集不得在未经相应出版商和/或作者许可的情况下发布。其主要用途是机器学习和 AI 软件开发的科研。

支持的任务和排行榜

  • text-generation:数据集可用于训练模型进行类似聊天的文本生成,特别是在算术、代数、几何、三角学、微积分、算法和数据结构、设计模式以及 Python 编程语言的上下文中生成解释和示例。

语言

数据集中的文本为英语。

数据集结构

数据实例

数据实例包含用户和助手之间的对话,讨论算术、代数、几何、三角学、微积分、算法和数据结构、设计模式或 Python 编程语言的主题。对话结构为一系列轮次,每个轮次包含角色("用户" 或 "助手")和轮次内容。

数据字段

  • role:表示对话中说话者角色的字符串("系统"、"用户"、"助手"、"函数")。
  • content:包含说话者轮次内容的字符串。

数据分割

数据集分为训练集、验证集和测试集。这些分割的确切大小和比例将取决于数据集的最终大小。

数据集创建

策划理由

该数据集旨在训练一个能够生成各种数学和计算机科学主题解释和示例的模型。目标是创建一个能够提供清晰、准确和教育学上合理的用户查询响应的 AI 助手。

源数据

初始数据收集和规范化

数据从涵盖算术、代数、几何、三角学、微积分、算法和数据结构、设计模式以及 Python 编程语言的各种教科书中收集。使用的教科书包括:

  • Barrons Arithmetic The Easy Way Fourth Edition
  • Blitzer Introductory Algebra for College Students Fifth Edition
  • McDougal Littell Geometry
  • Blitzer Intermediate Algebra for College Students 5th Edition
  • Trigonometry Sixth Edition
  • Pearson College Algebra Fourth Edition
  • Hughes-Hallet Applied Calculus 5th Edition
  • CLRS Introduction to Algorithms Third Edition

此外,数据集还包括以下在线资源的材料:

这些资源提供了 C、C++ 和 Python 编程语言的最新信息和示例。数据集还包括由 OpenAI 的 GPT 模型合成的代码样本,由数据集创建者精选。

源语言生产者

原始语言数据由编写教科书和其他材料的作者和教育者创建。这些包括:

  • Barrons Arithmetic The Easy Way Fourth Edition - Edward Williams, Katie Prindle
  • Blitzer Introductory Algebra for College Students Fifth Edition - Robert Blitzer
  • McDougal Littell Geometry - Ron Larson, Laurie Boswell, Timothy D. Kanold, Lee Stiff
  • Blitzer Intermediate Algebra for College Students 5th Edition - Robert Blitzer
  • Trigonometry Sixth Edition - Charles P. McKeague, Mark D. Turner
  • Pearson College Algebra Fourth Edition - Robert F. Blitzer
  • Hughes-Hallet Applied Calculus 5th Edition - Deborah Hughes-Hallett, Andrew M. Gleason, Patti Frazer Lock, Daniel E. Flath, Sheldon P. Gordon, David O. Lomen, David Lovelock, William G. McCallum, Brad G. Osgood, Andrew Pasquale, Jeff Tecosky-Feldman, Joseph Thrash, Karen R. Rhea, Thomas W. Tucker
  • CLRS Introduction to Algorithms Third Edition - Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, Clifford Stein

注释

注释过程

数据集不包含任何显式注释。然而,数据通过 OpenAI 的 GPT-3.5 和 GPT-4 模型进行精选和合成。过程涉及将源材料转换为适合训练 phi-1 模型的对话格式。数据集创建者,一位对计算机科学有浓厚兴趣的独立学习者,审查和精选了合成的对话以确保其质量和相关性。

注释者

数据集创建者,一位以自学方式广泛学习计算机科学的独立学习者,执行了合成对话的精选和审查。

个人和敏感信息

数据集不包含任何个人或敏感信息。所有数据均来自公开可用的教科书和在线资源。源材料中的任何名称或其他潜在标识符已被删除或匿名化。

数据集的社会影响

该数据集旨在支持能够提供算术、代数、几何、三角学、微积分、算法和数据结构、设计模式以及 Python 编程语言详细解释和示例的 AI 模型的发展。潜在的社会影响是显著的,因为此类模型可以极大地增强自主学习并为全球学生提供宝贵的教育支持。

讨论偏见

数据集基于各种教科书和在线资源,这些资源可能包含其自身的固有偏见。例如,教科书往往反映其作者的观点和偏见,这可能影响信息的呈现方式。这些偏见可能潜在地反映在数据集和任何基于其训练的模型中。

其他已知限制

在数据集创建过程的这一阶段,难以识别所有潜在限制。然而,一个潜在限制是数据集可能未涵盖其所涉及领域的所有可能主题或观点。数据集创建者将继续监控和评估数据集的限制,随着工作的进展。

附加信息

数据集策展人

数据集由一位对计算机科学有浓厚兴趣的独立学习者策展。策展人以自学方式研究了该主题,使用了各种资源,包括教科书和在线材料。策展过程还涉及使用 OpenAI 的 GPT-3.5 和 GPT-4 模型基于源材料合成对话。

许可信息

该数据集在 Creative Commons Attribution-NonCommercial-ShareAlike 3.0 International (CC BY-NC-SA 3.0) 许可下发布。

引用信息

由于该数据集是各种来源的合成和精选,用于训练 phi-1 模型,请在使用此数据集时确保引用原始来源。如果直接引用数据集,请参考此存储库。

AI搜集汇总
数据集介绍
main_image_url
构建方式
本数据集的构建,立足于将数学与计算机科学领域的教科书内容,通过OpenAI的GPT-3.5与GPT-4模型进行转换与合成,进而形成适合phi-1模型训练的高质量对话格式数据。数据从选定的教科书和在线资源中提取,经过格式化、分块处理和合成,最终形成包含对话的实例,用于模型的训练。
特点
该数据集的特点在于,其内容源自多样化的教科书和在线资源,覆盖了算术、代数、几何、三角学、微积分、算法与数据结构、设计模式以及Python编程语言等多个领域。数据集以对话形式呈现,包含用户与助手的互动,旨在训练出能够生成清晰、准确、教学性强的文本的AI模型。此外,数据集在构建过程中注重版权保护,遵循公平使用原则。
使用方法
使用该数据集时,用户需遵循其发布的许可协议,主要用于机器学习和AI软件开发的科研目的。数据集提供了训练集、验证集和测试集的划分,用户可以根据模型的参数量和序列长度选择合适的训练数据。在使用过程中,应确保遵守版权法规,不得用于商业用途。
背景与挑战
背景概述
Phi-1 Model Dataset乃是为了训练phi-1模型而构建的数据集,其理论基础来源于论文'Textbooks are All You Need'。该数据集汇集了来自不同教科书的高质量数据,并利用OpenAI的GPT-3.5和GPT-4模型进行了转换和合成。自2023年7月3日起,该数据集尚处于规划和收集的初步阶段,其目的是为了研发一种能够就数学和计算机科学领域的多个主题提供清晰、准确、教学性强的解释和示例的AI助手。数据集的构建与发布需获得相关出版商和/或作者的许可,目前主要用于机器学习和AI软件开发的研究。
当前挑战
该数据集面临的挑战主要包括:确保获得所有相关出版商和作者对于使用其作品的统一许可,这对于数据集的发布至关重要;同时,数据集在覆盖数学和计算机科学领域的全面性和深入性方面可能存在局限,且需注意源自教科书和其他资源的潜在偏见可能对训练出的模型产生影响。此外,数据集构建过程中的质量控制和准确性校验也是保证模型效果的关键挑战。
常用场景
经典使用场景
在人工智能辅助教育领域,Phi-1模型数据集的应用显得尤为重要。该数据集基于教科书内容,通过深度学习模型转化和合成,其经典使用场景在于训练能够生成数学和计算机科学领域问题解答和例子的AI模型,进而为学习者提供实时、准确的学术辅导。
实际应用
实际应用中,Phi-1数据集可被用于开发智能教育助手,该助手能够针对数学和计算机科学相关的问题提供解答和教学支持。这种应用不仅能够提高学习效率,还能为缺乏师资资源的地区提供高质量的教育辅助。
衍生相关工作
基于Phi-1数据集,衍生出了多项研究工作,包括但不限于开发更加智能的问答系统、优化教学内容的呈现方式,以及探索AI在教育评估中的新方法。这些相关工作进一步推动了教育信息技术的进步,为个性化学习和智能教育提供了新的视角和工具。
以上内容由AI搜集并总结生成
用户留言
有没有相关的论文或文献参考?
这个数据集是基于什么背景创建的?
数据集的作者是谁?
能帮我联系到这个数据集的作者吗?
这个数据集如何下载?
点击留言
数据主题
具身智能
数据集  4098个
机构  8个
大模型
数据集  439个
机构  10个
无人机
数据集  37个
机构  6个
指令微调
数据集  36个
机构  6个
蛋白质结构
数据集  50个
机构  8个
空间智能
数据集  21个
机构  5个
5,000+
优质数据集
54 个
任务类型
进入经典数据集
热门数据集

MultiTalk

MultiTalk数据集是由韩国科学技术院创建,包含超过420小时的2D视频,涵盖20种不同语言,旨在解决多语言环境下3D说话头生成的问题。该数据集通过自动化管道从YouTube收集,每段视频都配有语言标签和伪转录,部分视频还包含伪3D网格顶点。数据集的创建过程包括视频收集、主动说话者验证和正面人脸验证,确保数据质量。MultiTalk数据集的应用领域主要集中在提升多语言3D说话头生成的准确性和表现力,通过引入语言特定风格嵌入,使模型能够捕捉每种语言独特的嘴部运动。

arXiv 收录

中国劳动力动态调查

“中国劳动力动态调查” (China Labor-force Dynamics Survey,简称 CLDS)是“985”三期“中山大学社会科学特色数据库建设”专项内容,CLDS的目的是通过对中国城乡以村/居为追踪范围的家庭、劳动力个体开展每两年一次的动态追踪调查,系统地监测村/居社区的社会结构和家庭、劳动力个体的变化与相互影响,建立劳动力、家庭和社区三个层次上的追踪数据库,从而为进行实证导向的高质量的理论研究和政策研究提供基础数据。

中国学术调查数据资料库 收录

PDT Dataset

PDT数据集是由山东计算机科学中心(国家超级计算济南中心)和齐鲁工业大学(山东省科学院)联合开发的无人机目标检测数据集,专门用于检测树木病虫害。该数据集包含高分辨率和低分辨率两种版本,共计5775张图像,涵盖了健康和受病虫害影响的松树图像。数据集的创建过程包括实地采集、数据预处理和人工标注,旨在为无人机在农业中的精准喷洒提供高精度的目标检测支持。PDT数据集的应用领域主要集中在农业无人机技术,旨在提高无人机在植物保护中的目标识别精度,解决传统检测模型在实际应用中的不足。

arXiv 收录

CE-CSL

CE-CSL数据集是由哈尔滨工程大学智能科学与工程学院创建的中文连续手语数据集,旨在解决现有数据集在复杂环境下的局限性。该数据集包含5,988个从日常生活场景中收集的连续手语视频片段,涵盖超过70种不同的复杂背景,确保了数据集的代表性和泛化能力。数据集的创建过程严格遵循实际应用导向,通过收集大量真实场景下的手语视频材料,覆盖了广泛的情境变化和环境复杂性。CE-CSL数据集主要应用于连续手语识别领域,旨在提高手语识别技术在复杂环境中的准确性和效率,促进聋人与听人社区之间的无障碍沟通。

arXiv 收录

中国气象数据

本数据集包含了中国2023年1月至11月的气象数据,包括日照时间、降雨量、温度、风速等关键数据。通过这些数据,可以深入了解气象现象对不同地区的影响,并通过可视化工具揭示中国的气温分布、降水情况、风速趋势等。

github 收录