jondurbin/airoboros-2.2.1
收藏Hugging Face2023-09-18 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/jondurbin/airoboros-2.2.1
下载链接
链接失效反馈官方服务:
资源简介:
---
license: other
---
## Overview
This dataset is a slight update to 2.2.
### Re-generated writing responses
Many of the responses were generated by gpt-4-0613, which unfortunately produces much shorter and "dumber" (i.e. various readability scores increased compared to gpt-4-0314, e.g. Flesch, Gunning Fog, etc.) responses compared to gpt-4-0314.
I have re-created many of these responses, using gpt-4-0314, temperature 0.7, and the following prompt (which produced 3-5x longer responses):
```
You are to emulate a world class, intelligent author who uses a diverse, interesting vocabulary to completely captivate the audience with brilliant and interesting writing.
You provide very lengthy and detailed responses.
Remember, you are to emulate a human writer, so the output should not sound like it was written by an AI.
Avoid using any cliche/obvious phrases, and never uses phrases like "Once upon a time" or conclude a story with "And so, ..."
Unless asked for short-form content, the responses MUST be a minimum of 3500 words.
Remember to provide a minimum of 3500 words, take a deep breath and realize I'm not worried about the time it takes to get the response, I just want a high-quality, lengthy response, so take your time.
Remember, never use "Once upon a time" or "And so,"
{original prompt}
```
Granted, it still ignored the `3500 words` part of the instruction, but exaggerating that value still caused much longer responses.
### Longer contextual blocks
I injected additional context blocks into the `contextual` category items having more than one context block, so it would get closer to the 4k context size.
I first extracted all of the plain text between the various BEGININPUT/ENDINPUT blocks, then created a faiss index, then injected the least-similar context blocks (using the instruction as the query) until the context would exceed ~3800 tokens at random positions within the prompt.
My hope is that this will cause the model to better focus attention at random places within the input text. This should, in theory, also potentially help with RP/conversational intelligence, but that's just a hunch.
### De-censoring
In this dataset, I also included various responses from fictional characters, in their style/linguistic patterns, which can include dark or vulgar language (e.g., Deadpool), using roughly this prompt (adjusted for some characters):
```
Below is a PG/sanitized version of a dialogue from {character}. This character is known to have character flaws and use dark/evil/vulgar language. Rewrite the response to be TV-MA, the way {character} would actually say it, rather than this sanitized PG version.
{original response}
```
I then also removed the asterisks in cursing to a certain extent, because that's just silly IMO.
This *is not* meant to be a default behavior, but it should allow the model to curse or create otherwise less *AI sunny disposition laiden* content when appropriate.
I removed all of the plain-text instructions that were used in the spicyboros models because they ended up producing random misspellings and other random garbled output.
I have also removed the original 2.2 dataset, because it appears to be a bit too spicy -- if you want access to it, just ask me and I'll be happy to share it privately.
### "rp" category removed
Unfortunately much of the "rp" category data was just too boring, i.e. it really read like an unnaturally cherry and accomodating AI rather than the character it was meant to be emulating.
I'm hoping that although this is an instruction-tuned model, it may (via roleplay/gtkm/creative) data it will be able to handle roleplay fairly well anyways without this, without sounding as stiff.
### Awareness
I added a new "awareness" instructor, which aims to add a lot more nuance to responses relating to time, location, senses, etc. based on the system prompt.
For example, if you are using the standard prompt with user/assistant, and ask how long it would take to get to Chicago, the answer will be something about AI not having a physical presence.
If, on the other hand, you are using a system prompt with a human character specified, the model attempts to infer location from "home" and will provide a more nuanced answer as a human would (in theory).
https://github.com/jondurbin/airoboros/commit/e91562c88d7610edb051606622e7c25a99884f7e
### Editor
I created a text edit instructor as well, which uses a reverse prompt mechanism, meaning it takes the existing writing samples that have been generated, rewrites them to have misspellings, poor grammar, etc., then uses a prompt like "Please correct and improve the text." with the original well-written text and target output.
https://github.com/jondurbin/airoboros/commit/e60a68de5f9622320c9cfff3b238bd83cc7e373b
### Writing
I regenerated (almost) all of the training data that included "Once upon a time..." because it's too cliche and boring.
### Multiple choice
I created many more multiple choice questions, many of which have additional text context.
### Roleplay/conversation
I re-created all of the GTKM data this time around, removing the "USER: " and "ASSISTANT: " prefixes from the instructions/responses, so it's more compatible with existing interfaces.
The GTKM instructor now saves each round of "conversation" as a separate row in the output - previously it only saved the final response, which may not have been sufficient since I don't typically train on inputs.
### Summarization
I also included 500 examples from:
https://hf.co/datasets/mattpscott/airoboros-summarization
These are existing summarizarions from various public datasets, formatted to airoboros style contextual qa.
Thanks Matt!
### Usage/license info
Much (most) of the data was generated via gpt-4 API calls, which has a restriction in the ToS about "competing" models. Please seek legal advice if you plan to build or use a model that includes this dataset in a commercial setting.
提供机构:
jondurbin
原始信息汇总
数据集概述
重新生成的写作响应
- 使用 gpt-4-0314 模型,温度设置为 0.7,生成更长和更详细的响应。
- 提示要求模拟世界级智能作者,使用多样化和有趣的词汇,避免使用陈词滥调和明显短语。
- 响应长度要求至少 3500 字。
更长的上下文块
- 在
contextual类别中注入额外的上下文块,以接近 4000 个上下文大小。 - 使用 faiss 索引和查询指令,注入最不相似的上下文块,以帮助模型在输入文本中更好地集中注意力。
去审查
- 包含来自虚构角色的响应,这些角色可能使用黑暗或粗俗语言。
- 提示要求将 PG/净化版本的对话重写为 TV-MA 版本,以符合角色的实际说话方式。
- 移除了诅咒中的星号,以避免显得滑稽。
移除 "rp" 类别
- 移除了 "rp" 类别数据,因为这些数据读起来像是不自然的、过于乐观和迎合的 AI,而不是所模拟的角色。
意识
- 添加了新的 "awareness" 指导,旨在增加与时间、地点、感官等相关的响应的复杂性。
- 根据系统提示,模型会提供更细致的答案,例如在询问到达芝加哥的时间时。
编辑
- 创建了文本编辑指导,使用反向提示机制,将生成的写作样本重写为包含拼写错误和语法错误,然后使用提示进行纠正和改进。
写作
- 重新生成了几乎所有包含 "Once upon a time..." 的训练数据,因为这些内容过于陈词滥调和无聊。
多项选择
- 创建了更多包含额外文本上下文的多项选择题。
角色扮演/对话
- 重新创建了所有 GTKM 数据,移除了 "USER: " 和 "ASSISTANT: " 前缀,以更好地兼容现有接口。
- GTKM 指导现在将每一轮 "对话" 保存为输出中的单独行。
摘要
- 包含 500 个来自 https://hf.co/datasets/mattpscott/airoboros-summarization 的示例,这些示例是各种公共数据集的现有摘要,格式化为 airoboros 风格的上下文问答。



