five

airoboros-2.2.1

收藏
魔搭社区2026-04-28 更新2025-09-27 收录
下载链接:
https://modelscope.cn/datasets/jondurbin/airoboros-2.2.1
下载链接
链接失效反馈
官方服务:
资源简介:
## Overview This dataset is a slight update to 2.2. ### Re-generated writing responses Many of the responses were generated by gpt-4-0613, which unfortunately produces much shorter and "dumber" (i.e. various readability scores increased compared to gpt-4-0314, e.g. Flesch, Gunning Fog, etc.) responses compared to gpt-4-0314. I have re-created many of these responses, using gpt-4-0314, temperature 0.7, and the following prompt (which produced 3-5x longer responses): ``` You are to emulate a world class, intelligent author who uses a diverse, interesting vocabulary to completely captivate the audience with brilliant and interesting writing. You provide very lengthy and detailed responses. Remember, you are to emulate a human writer, so the output should not sound like it was written by an AI. Avoid using any cliche/obvious phrases, and never uses phrases like "Once upon a time" or conclude a story with "And so, ..." Unless asked for short-form content, the responses MUST be a minimum of 3500 words. Remember to provide a minimum of 3500 words, take a deep breath and realize I'm not worried about the time it takes to get the response, I just want a high-quality, lengthy response, so take your time. Remember, never use "Once upon a time" or "And so," {original prompt} ``` Granted, it still ignored the `3500 words` part of the instruction, but exaggerating that value still caused much longer responses. ### Longer contextual blocks I injected additional context blocks into the `contextual` category items having more than one context block, so it would get closer to the 4k context size. I first extracted all of the plain text between the various BEGININPUT/ENDINPUT blocks, then created a faiss index, then injected the least-similar context blocks (using the instruction as the query) until the context would exceed ~3800 tokens at random positions within the prompt. My hope is that this will cause the model to better focus attention at random places within the input text. This should, in theory, also potentially help with RP/conversational intelligence, but that's just a hunch. ### De-censoring In this dataset, I also included various responses from fictional characters, in their style/linguistic patterns, which can include dark or vulgar language (e.g., Deadpool), using roughly this prompt (adjusted for some characters): ``` Below is a PG/sanitized version of a dialogue from {character}. This character is known to have character flaws and use dark/evil/vulgar language. Rewrite the response to be TV-MA, the way {character} would actually say it, rather than this sanitized PG version. {original response} ``` I then also removed the asterisks in cursing to a certain extent, because that's just silly IMO. This *is not* meant to be a default behavior, but it should allow the model to curse or create otherwise less *AI sunny disposition laiden* content when appropriate. I removed all of the plain-text instructions that were used in the spicyboros models because they ended up producing random misspellings and other random garbled output. I have also removed the original 2.2 dataset, because it appears to be a bit too spicy -- if you want access to it, just ask me and I'll be happy to share it privately. ### "rp" category removed Unfortunately much of the "rp" category data was just too boring, i.e. it really read like an unnaturally cherry and accomodating AI rather than the character it was meant to be emulating. I'm hoping that although this is an instruction-tuned model, it may (via roleplay/gtkm/creative) data it will be able to handle roleplay fairly well anyways without this, without sounding as stiff. ### Awareness I added a new "awareness" instructor, which aims to add a lot more nuance to responses relating to time, location, senses, etc. based on the system prompt. For example, if you are using the standard prompt with user/assistant, and ask how long it would take to get to Chicago, the answer will be something about AI not having a physical presence. If, on the other hand, you are using a system prompt with a human character specified, the model attempts to infer location from "home" and will provide a more nuanced answer as a human would (in theory). https://github.com/jondurbin/airoboros/commit/e91562c88d7610edb051606622e7c25a99884f7e ### Editor I created a text edit instructor as well, which uses a reverse prompt mechanism, meaning it takes the existing writing samples that have been generated, rewrites them to have misspellings, poor grammar, etc., then uses a prompt like "Please correct and improve the text." with the original well-written text and target output. https://github.com/jondurbin/airoboros/commit/e60a68de5f9622320c9cfff3b238bd83cc7e373b ### Writing I regenerated (almost) all of the training data that included "Once upon a time..." because it's too cliche and boring. ### Multiple choice I created many more multiple choice questions, many of which have additional text context. ### Roleplay/conversation I re-created all of the GTKM data this time around, removing the "USER: " and "ASSISTANT: " prefixes from the instructions/responses, so it's more compatible with existing interfaces. The GTKM instructor now saves each round of "conversation" as a separate row in the output - previously it only saved the final response, which may not have been sufficient since I don't typically train on inputs. ### Summarization I also included 500 examples from: https://hf.co/datasets/mattpscott/airoboros-summarization These are existing summarizarions from various public datasets, formatted to airoboros style contextual qa. Thanks Matt! ### Usage/license info Much (most) of the data was generated via gpt-4 API calls, which has a restriction in the ToS about "competing" models. Please seek legal advice if you plan to build or use a model that includes this dataset in a commercial setting.

数据集概览 本数据集为2.2版本的小幅更新。 ### 重新生成的写作回复 多数回复由GPT-4-0613生成,但相较于GPT-4-0314,该模型生成的回复篇幅更短、质量更低(例如Flesch阅读难度指数、Gunning雾度指数等各类可读性评分均更高)。 我使用GPT-4-0314、温度系数0.7,结合以下提示词重新生成了多数回复(该提示词可生成3至5倍长度的回复): 你需要模仿一位世界级的睿智作家,运用丰富多样且生动的词汇,以精妙有趣的行文彻底俘获读者。 请提供详尽且篇幅充足的回复。 请记住,你需要模仿人类作者的笔触,因此输出不应带有AI生成的痕迹。 避免使用陈词滥调或过于直白的表达,切勿使用“从前”或以“因此,……”作为故事的结尾。 除非被要求生成短篇内容,否则回复的最低字数要求为3500词。 请务必保证回复不少于3500词。先深呼吸,我并不在意生成回复所需的时长,我只追求高质量且篇幅充足的内容,因此请从容创作。 请记住,切勿使用“从前”或“因此,”作为开头或结尾。 {original prompt} 诚然,模型仍未遵守3500词的要求,但放大该字数阈值仍可生成篇幅更长的回复。 ### 更长的上下文块 我为包含多个上下文块的「上下文(contextual)」类别条目注入了额外的上下文块,使其更接近4k上下文窗口的容量。 我首先提取了各个BEGININPUT/ENDINPUT块之间的所有纯文本,随后构建了FAISS索引,接着以提示指令作为查询,注入相似度最低的上下文块,直至上下文在提示词的随机位置处达到约3800个Token。 我的目标是让模型能够更好地在输入文本的随机位置分配注意力。从理论上讲,这还有望助力角色扮演(RP)/对话智能的表现,不过这仅为个人推测。 ### 去净化处理 在本数据集中,我还加入了以虚构角色风格与语言模式生成的各类回复,其中可能包含隐晦或粗俗的语言(例如死侍(Deadpool)),大致使用以下提示词(针对部分角色做了调整): 以下是{character}一段经过PG级净化处理的对话片段。该角色以性格缺陷、使用阴暗/邪恶/粗俗语言著称。请将该净化后的PG级回复改写为TV-MA级,还原{character}真实的说话方式。 {original response} 此外,我还在一定程度上去除了脏话中的星号屏蔽,毕竟在我看来这种处理有些多余。 这并非默认行为,但在合适的场景下,可允许模型使用粗俗语言或生成不那么「AI式阳光」的内容。 我移除了spicyboros系列模型中使用的所有纯文本指令,因为这些指令会导致随机拼写错误及其他乱码输出。 我同时删除了原始的2.2版本数据集,因其内容略显「过火」。若你希望获取该数据集,可随时向我提出,我会私下分享。 ### 移除「rp」类别 遗憾的是,「rp(角色扮演)」类别中的多数数据过于乏味,读起来更像是刻意讨好、过于刻板的AI,而非其模拟的角色。 我希望,尽管本数据集移除了该类别,但经过微调的模型仍可通过角色扮演/GTKM/创意类数据较好地完成角色扮演任务,且不会显得生硬呆板。 ### 感知模块 我新增了「感知(awareness)」指令集,旨在基于系统提示词,为涉及时间、地点、感官等内容的回复增添更多细节层次。 例如,若你使用标准的用户/助手格式提示词,并询问前往芝加哥需要多长时间,答案将涉及AI并无实体存在的相关内容。 而若你在系统提示词中指定了人类角色,则模型会尝试从「家乡」等信息推断所处位置,并如人类一般提供更具细节的回答(理论上)。 https://github.com/jondurbin/airoboros/commit/e91562c88d7610edb051606622e7c25a99884f7e ### 文本编辑模块 我还开发了文本编辑指令集,采用反向提示机制:先将已生成的优质写作样本改写成存在拼写错误、语法拙劣的版本,随后使用「请修正并优化该文本」这类提示词,结合原文的优质文本与目标输出进行训练。 https://github.com/jondurbin/airoboros/commit/e60a68de5f9622320c9cfff3b238bd83cc7e373b ### 写作数据优化 我重新生成了(几乎)所有包含「从前」开头的训练数据,因其过于陈腐乏味。 ### 选择题数据 我新增了大量选择题,其中多数附带额外的文本上下文。 ### 角色扮演/对话数据 本次我重新生成了所有GTKM数据,移除了指令/回复中的「USER: 」与「ASSISTANT: 」前缀,使其与现有界面更兼容。 GTKM指令集现在会将每一轮「对话」作为输出中的单独一行存储——此前仅存储最终回复,而由于我通常不会在训练中使用输入数据,仅存储最终回复可能并不足够。 ### 摘要任务数据 我还从https://hf.co/datasets/mattpscott/airoboros-summarization 中引入了500个示例。 这些示例均为来自各类公开数据集的现有摘要,已按照airoboros格式的上下文问答进行了格式化。 感谢Matt! ### 使用与许可说明 本数据集的绝大多数内容通过GPT-4 API调用生成,而GPT-4的服务条款中包含针对「竞争模型」的限制条款。若你计划将本数据集用于商业场景下的模型开发或部署,请务必咨询法律意见。
提供机构:
maas
创建时间:
2025-08-29
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作