Azure99/blossom-v6.3-sft-stage1
收藏Hugging Face2025-12-06 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/Azure99/blossom-v6.3-sft-stage1
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
language:
- zh
- en
task_categories:
- text-generation
size_categories:
- 100K<n<1M
---
# BLOSSOM V6.3 SFT STAGE1
### Introduction
BLOSSOM V6.3 SFT Stage1 is a high-quality, diverse large language model fine-tuning dataset designed for the first-stage SFT training of the Blossom V6.3 model. Its purpose is to help the model initially align dialogue capabilities through exposure to large-scale synthetic data.
While open-source large language models often release model weights and technical reports, the most advanced open-source models typically withhold their pre-training and post-training data, making it difficult for the community to replicate their capabilities. Blossom is committed to providing researchers with reproducible post-training data for model capability development.
**Data Sources**: WildChat, Wizard, Stackoverflow, Math, Magpie, InfinityPreference, Code, Flan, Olcc, Ruozhiba, etc.
**Synthesis Workflow Overview**:
Primarily employs three cost-effective models: Deepseek-V3.1, Gemini 2.5 Flash, and Qwen3-235B-A22B-Instruct-2507 (denoted as A, B, C)—to regenerate responses under different scenarios using tailored synthesis strategies.
For example:
- In objective scenarios like mathematics (where answers are unique), Model A first generates responses as a "teacher." If reference answers exist in the source data, Model B verifies the correctness of A's responses against them. If no reference answers exist, Model C generates a second response, and Model B checks consistency between A and C's outputs. Inconsistent responses are filtered out.
- For subjective scenarios, three models cross-evaluate each other. For instance, Models A and B generate responses to a question, and Model C evaluates which is better. The superior response may be retained as training data or used for preference data construction. To mitigate model bias, roles (respondent/evaluator) are randomly assigned to A, B, and C in each instance.
Additional rule-based filtering is applied, such as:
- N-Gram filtering to remove data with many repetitions.
- Discarding questions containing toxic content that triggers teacher model refusals.
Further technical details will be released in the future. The data is synthesized by the [🌸BlossomData](https://github.com/Azure99/BlossomData) framework.
### Languages
Primarily Chinese and English, with a roughly 1:1 ratio of Chinese-to-English data.
### Dataset Structure
Each entry represents a conversational sample with the following fields:
- `id`: Unique identifier combined with `metadata.source`.
- `type`: Always set to `chat`.
- `metadata`: Contains `source` indicating the data origin.
- `messages`: A list of dialogue messages. Each message includes `role` (`user` or `assistant`) and `content` (text).
### Limitations
This dataset is AI-generated. Despite preliminary validation and filtering, it may still contain inaccuracies or severe errors.
提供机构:
Azure99



