five

airoboros-gpt4-2.0

收藏
魔搭社区2026-05-11 更新2025-09-13 收录
下载链接:
https://modelscope.cn/datasets/jondurbin/airoboros-gpt4-2.0
下载链接
链接失效反馈
官方服务:
资源简介:
## Overview This is a brand new dataset, with nothing copied from the 1.* series of airoboros, using only the June version of gpt-4. I used the latest overhaul of the airoboros python tool to generate the data, which has several "instructions", where an instructor is a specific prompt/response generator. The instructors include: - agent/function style prompts, which generate a function name and args based on the provided input and available functions in either JSON or YAML format - model/scenario/character cards, to help build random descriptive cards based on a template - coding and scripting - contextual q&a with the specific context obedient formatting - chain-of-thought, i.e. for a given question, generate ~3 possible solutions, rank them, select the best - experience, e.g. guided meditations or describing a walk through a forest - general - completely random tasks not specifically targetting any type of task, using a random list of topics - jokes - still horrible, but at least there are some now - orca, i.e. "Solve [problem], provide step-by-step reasoning." - execution planning, specifically the reWOO style, where you describe a list of available functions and it will generate a plan to make use of them - riddles - still not great either, but present - roleplay - songs - wordgames, e.g. give me a list of 28 words that start with 'cr' - creative writing **Is it better than 1.4?** Not necessarily. It has some extra functionality that didn't exist before, but if you want to be sure you don't lose much, check out m2.0, with is a merge of 1.4.1 and 2.0: https://huggingface.co/datasets/jondurbin/airoboros-gpt4-m2.0 The main point here was to test the June version of gpt-4 against the March version (and add new prompt types). ### Category breakdown ![chart](breakdown.png) ### Configuration for airoboros https://gist.github.com/jondurbin/65df002c16560899e05365ca6cbd43e3 ### Licence and usage restrictions The data was generated by gpt-4 via OpenAI API calls. The ToS for OpenAI API usage has a clause preventing the output from being used to train a model that __competes__ with OpenAI - what does *compete* actually mean here? - these small open source models will not produce output anywhere near the quality of gpt-4, or even gpt-3.5, so I can't imagine this could credibly be considered competing in the first place - if someone else uses the dataset to do the same, they wouldn't necessarily be violating the ToS because they didn't call the API, so I don't know how that works - the training data used in essentially all large language models includes a significant of copyrighted or otherwise unallowable licensing in the first place - other work using the self-instruct method, e.g. the original here: https://github.com/yizhongw/self-instruct released the data and model as apache-2 I am purposingly leaving this license ambiguous (other than the fact you must comply with the Meta original license) because I am not a lawyer and refuse to attempt to interpret all of the terms accordingly. Your best bet is probably to avoid using this commercially due to the OpenAI API usage. Either way, by using this model, you agree to completely idemnify me from any and all license related issues. Attribution would be nice if you use some or all of the data.

## 概述 本数据集为全新构建,未沿用airoboros (airoboros) 1.*系列的任何内容,仅基于GPT-4 (GPT-4) 2023年6月版本生成。 本次数据生成采用了经过最新重构的airoboros Python工具,该工具内置多种「指令生成器(instructor)」,即针对特定提示词生成对应回复的模块。具体包含以下类型: - 智能体/函数风格提示词:基于输入内容与可用函数,以JSON或YAML格式生成函数名与参数 - 模型/场景/角色卡片:基于模板生成随机描述性卡片 - 编程与脚本编写 - 上下文问答:遵循特定上下文格式的问答任务 - 思维链(Chain-of-Thought):针对给定问题生成约3种可行解决方案,进行排序后选取最优解 - 体验类任务:例如引导式冥想、森林漫步场景描述 - 通用任务:基于随机主题列表生成无特定领域限制的随机任务 - 笑话生成:虽质量欠佳,但已纳入任务类型 - Orca风格任务:即“解决[问题]并提供逐步推理过程” - 执行规划:特指reWOO (reWOO) 风格任务,即通过描述可用函数列表生成对应的调用规划 - 谜语生成:虽质量一般,但已纳入 - 角色扮演 - 歌曲创作 - 文字游戏:例如“生成28个以cr开头的单词列表” - 创意写作 **相较于1.4版本更优吗?** 未必。本数据集新增了此前版本未涵盖的功能,但如果希望尽可能减少功能损失,可参考合并了1.4.1与2.0版本的m2.0数据集:https://huggingface.co/datasets/jondurbin/airoboros-gpt4-m2.0 本次构建的核心目的为对比测试GPT-4 6月版与3月版的生成效果,并新增了多种提示词类型。 ### 类别分布 ![chart](breakdown.png) ### airoboros工具配置 https://gist.github.com/jondurbin/65df002c16560899e05365ca6cbd43e3 ### 许可与使用限制 本数据集通过OpenAI API (OpenAI API) 调用GPT-4生成。 OpenAI API服务条款包含一项条款,禁止将生成的输出数据用于训练与OpenAI存在竞争关系的模型: - 此处的「竞争」具体指什么? - 当前主流开源小模型的输出质量远不及GPT-4,甚至不及GPT-3.5,因此此类模型很难被认定为与OpenAI形成实质性竞争 - 若其他使用者基于本数据集进行同类训练,由于未直接调用OpenAI API,未必违反服务条款,因此相关合规性尚不明确 - 事实上,绝大多数大语言模型的训练数据均包含大量受版权保护或未经授权许可的内容 - 其他采用自指令(self-instruct)方法的相关工作,例如原始自指令项目:https://github.com/yizhongw/self-instruct,已将其数据与模型以Apache-2.0协议开源 由于本人并非法律专业人士,无意对所有条款进行解读,因此本数据集的许可协议未做明确限定(但需遵守Meta原始许可协议)。 鉴于本数据集基于OpenAI API生成,建议避免将其用于商业用途。 无论如何,使用本数据集即视为同意豁免本人因许可相关问题产生的一切法律责任。 若您使用本数据集的全部或部分内容,恳请进行署名标注。
提供机构:
maas
创建时间:
2025-08-29
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作