jondurbin/airoboros-gpt4-1.1
收藏Hugging Face2023-06-22 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/jondurbin/airoboros-gpt4-1.1
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-nc-4.0
---
The data was generated by gpt-4, and therefore is subject to OpenAI ToS. The tool used to generate the data [airoboros](https://github.com/jondurbin/airoboros) is apache-2.
Specific areas of focus for this training data:
* trivia
* math
* nonsensical math
* coding
* closed context question answering
* closed context question answering, with multiple contexts to choose from as confounding factors
* writing
* multiple choice
This is largely an overlap of the original [dataset](https://huggingface.co/datasets/jondurbin/airoboros-gpt4), but with a few extras:
* fixed contextual entries that were missing closing tags (e.g. "ENDINPUT", "ENDINSTRUCTION", etc.)
* fixed an issue where source information was provided, even if not asked (the model always tried to provide source info)
* added some questions that were unrelated to the provided context, to train the model to say when it can't provide an answer
* added several new contexual instructions, including some with FAQ style to hopefully prevent questions in the context from breaking the inference
* hundreds more coding samples, focusing primarily on python, java, javascript, c/c++, and golang
### Usage and License Notices
All airoboros models and datasets are intended and licensed for research use only. I've used the 'cc-nc-4.0' license, but really it is subject to a custom/special license because:
- the base model is LLaMa, which has it's own special research license
- the dataset(s) were generated with OpenAI (gpt-4 and/or gpt-3.5-turbo), which has a clausing saying the data can't be used to create models to compete with openai
So, to reiterate: this model (and datasets) cannot be used commercially.
提供机构:
jondurbin
原始信息汇总
数据集概述
数据来源
- 数据由gpt-4生成,受OpenAI服务条款约束。
- 数据生成工具为airoboros,遵循Apache-2许可。
数据内容
- 主要关注领域包括:
- 琐事
- 数学
- 荒谬数学
- 编程
- 封闭情境下的问答
- 封闭情境下的问答,包含多个情境作为混淆因素
- 写作
- 多项选择
数据集特点
- 与原始数据集jondurbin/airoboros-gpt4有较大重叠,但增加了以下内容:
- 修复了缺失闭合标签的上下文条目(如"ENDINPUT", "ENDINSTRUCTION"等)。
- 解决了模型即使未被要求也提供源信息的问题。
- 添加了一些与提供情境无关的问题,以训练模型在无法提供答案时进行识别。
- 增加了新的上下文指令,包括FAQ样式,以防止情境中的问题破坏推理。
- 增加了数百个编程示例,主要集中在Python, Java, JavaScript, C/C++, 和Golang。
使用许可
- 数据集和模型仅供研究使用,遵循cc-nc-4.0许可。
- 由于基础模型LLaMa具有特殊研究许可,且数据生成使用了OpenAI的gpt-4和/或gpt-3.5-turbo,数据集实际上受限于特殊的许可条件,禁止商业使用。



