airoboros-3.2
收藏魔搭社区2026-01-09 更新2025-02-15 收录
下载链接:
https://modelscope.cn/datasets/jondurbin/airoboros-3.2
下载链接
链接失效反馈官方服务:
资源简介:
## Overview
This dataset is a continuation of the [airoboros-3.1](https://hf.co/datasets/jondurbin/airoboros-3.1) with the following changes:
* MathJSON has been removed for the time-being, because it seems to confuse the models at times, causing more problems than it's worth. The mathjson dataset can be found [here](https://huggingface.co/datasets/jondurbin/mathjson-alpha)
* The de-censorship data has been re-added, to ensure a non-DPO SFT model using this dataset is relatively uncensored.
* ~11k instructions from [slimorca](https://huggingface.co/datasets/Open-Orca/SlimOrca) where extended to have an additional, follow-up turn to enhance multi-turn capabilities.
## Format
The format is now in ShareGPT format, to better accomodate the OS ecosystem fine-tuning tooling.
## Usage restriction
To use this data, you must acknowledge/agree to the following:
- a small sampling of the data contained within is "toxic"/"harmful", and contains profanity and other types of sensitive content
- none of the content or views contained in the dataset necessarily align with my personal beliefs or opinions, they are simply text generated by LLMs without a great amount of validation
- you are able to use the dataset lawfully, particularly in locations with less-than-free speech laws
- you, and you alone are responsible for having downloaded and used the dataset, and I am completely indemnified from any and all liabilities
Also note that the data was generated primarily with gpt-4, and therefore may have some strings attached to the OpenAI terms of service.
## 概述
本数据集为[airoboros-3.1](https://hf.co/datasets/jondurbin/airoboros-3.1)的后续迭代版本,更新内容如下:
* 暂时移除了MathJSON模块,因其时常会对模型造成干扰,弊大于利。MathJSON数据集可于[此处](https://huggingface.co/datasets/jondurbin/mathjson-alpha)获取。
* 重新加入了去内容审查(de-censorship)数据,以确保使用该数据集训练的非直接偏好优化(DPO)监督微调(SFT)模型具备相对无审查的输出能力。
* 从[slimorca](https://huggingface.co/datasets/Open-Orca/SlimOrca)中选取约1.1万条指令,并为其新增一轮后续对话,以增强模型的多轮对话能力。
## 格式说明
本数据集现已采用ShareGPT格式,以更好适配开源生态下的微调工具链。
## 使用限制
使用本数据集前,您需同意以下条款:
- 本数据集包含少量“有毒”/“有害”内容,涉及亵渎性语言及其他敏感信息;
- 数据集所载内容或观点均不代表笔者个人立场,其仅为大语言模型(LLM)生成的未经充分校验的文本;
- 您需确保在符合当地法律法规的前提下使用本数据集,尤其在言论自由受限的地区;
- 数据集的下载与使用责任由您本人承担,笔者不对任何相关法律责任承担连带责任。
另需注意,本数据集主要由GPT-4生成,因此需遵守OpenAI服务条款的相关约束。
提供机构:
maas
创建时间:
2025-08-29



