Azure99/blossom-v6.3-sft-stage2
收藏Hugging Face2025-12-06 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/Azure99/blossom-v6.3-sft-stage2
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
language:
- zh
- en
task_categories:
- text-generation
size_categories:
- 10K<n<100K
---
# BLOSSOM V6.3 SFT STAGE2
### Introduction
BLOSSOM V6.3 SFT Stage2 is a high-quality, diverse large language model fine-tuning dataset designed for the second-stage SFT training of the Blossom V6.3 model. Its purpose is to further enhance the model's ability to handle complex instructions on more rare real-world problems.
While open-source large language models often release model weights and technical reports, the most advanced open-source models typically withhold their pre-training and post-training data, making it difficult for the community to replicate their capabilities. Blossom is committed to providing researchers with reproducible post-training data for model capability development.
**Data Sources**: ShareGPT, WildChat, Wizard, Stackoverflow, Math, Magpie, InfinityPreference, Code, Flan, Olcc, Ruozhiba, etc.
**Synthesis Workflow Overview**:
Primarily employs three cost-effective models: Deepseek-V3.1, Gemini 2.5 Flash, and Qwen3-235B-A22B-Instruct-2507 (denoted as A, B, C)—to regenerate responses under different scenarios using tailored synthesis strategies.
For example:
- In objective scenarios like mathematics (where answers are unique), Model A first generates responses as a "teacher." If reference answers exist in the source data, Model B verifies the correctness of A's responses against them. If no reference answers exist, Model C generates a second response, and Model B checks consistency between A and C's outputs. Inconsistent responses are filtered out.
- For subjective scenarios, three models cross-evaluate each other. For instance, Models A and B generate responses to a question, and Model C evaluates which is better. The superior response may be retained as training data or used for preference data construction. To mitigate model bias, roles (respondent/evaluator) are randomly assigned to A, B, and C in each instance.
Additional rule-based filtering is applied, such as:
- N-Gram filtering to remove data with many repetitions.
- Discarding questions containing toxic content that triggers teacher model refusals.
Further technical details will be released in the future. The data is synthesized by the [🌸BlossomData](https://github.com/Azure99/BlossomData) framework.
### Languages
Primarily Chinese and English, with a roughly 1:1 ratio of Chinese-to-English data.
### Dataset Structure
Each entry represents a conversational sample with the following fields:
- `id`: Unique identifier combined with `metadata.source`.
- `type`: Always set to `chat`.
- `metadata`: Contains `source` indicating the data origin.
- `messages`: A list of dialogue messages. Each message includes `role` (`user` or `assistant`) and `content` (text).
### Limitations
This dataset is AI-generated. Despite preliminary validation and filtering, it may still contain inaccuracies or severe errors.
---
许可证: Apache-2.0
语言:
- 中文
- 英文
任务类别:
- 文本生成
数据规模:
- 1万 < 样本数 < 10万
---
# BLOSSOM V6.3 SFT STAGE2
### 简介
BLOSSOM V6.3 SFT 第二阶段数据集是专为 Blossom V6.3 模型的第二阶段监督微调(Supervised Fine-Tuning,SFT)训练打造的高质量、多样化大语言模型(Large Language Model,LLM)微调数据集。其核心目标是进一步提升模型针对真实场景中罕见复杂指令的处理能力。
尽管开源大语言模型通常会发布模型权重与技术报告,但顶尖开源模型往往不会公开其预训练与后训练数据,导致社区难以复现其模型性能。Blossom 项目致力于为研究者提供可复现的后训练数据,以推动模型能力的研发工作。
**数据来源**:ShareGPT、WildChat、Wizard、Stackoverflow、数学数据集(Math)、Magpie、InfinityPreference、代码数据集(Code)、Flan、Olcc、Ruozhiba 等。
**合成流程概述**:
本数据集主要采用三款高性价比模型——Deepseek-V3.1、Gemini 2.5 Flash 与 Qwen3-235B-A22B-Instruct-2507(分别记为 A、B、C),通过定制化合成策略在不同场景下重新生成模型回复。
示例如下:
- 针对数学等答案具有唯一性的客观场景:模型 A 先以“教师”身份生成回复。若源数据中存在参考答案,则模型 B 将基于参考答案校验 A 回复的正确性;若不存在参考答案,则模型 C 生成第二份回复,随后模型 B 验证 A 与 C 的输出是否一致,不一致的回复将被过滤。
- 针对主观场景,则由三款模型互相交叉评估。例如,模型 A 与 B 针对某一问题生成回复,由模型 C 评判哪份回复更优质。更优质的回复可被保留作为训练数据,或用于偏好数据的构建。为缓解模型偏见,每个样本中 A、B、C 的角色(应答者/评估者)均随机分配。
此外还应用了基于规则的额外过滤措施,具体包括:
- 采用 N-Gram 过滤以移除存在大量重复内容的数据;
- 丢弃包含触发教师模型拒绝生成的有害内容的问题。
更多技术细节将在未来公开。本数据集由 [🌸BlossomData](https://github.com/Azure99/BlossomData) 框架合成生成。
### 语言覆盖
本数据集主要覆盖中文与英文,中英数据占比大致为 1:1。
### 数据集结构
每条数据代表一个对话样本,包含以下字段:
- `id`:结合`metadata.source`生成的唯一标识符;
- `type`:固定为`chat`;
- `metadata`:包含`source`字段,用于标识数据来源;
- `messages`:对话消息列表。每条消息包含`role`(取值为`user`或`assistant`)与`content`(文本内容)。
### 局限性说明
本数据集由人工智能生成。尽管已进行初步验证与过滤,仍可能存在不准确之处或严重错误。
提供机构:
Azure99



