Azure99/blossom-v6.3-sft-stage1

Name: Azure99/blossom-v6.3-sft-stage1
Creator: Azure99
Published: 2025-12-06 19:25:00
License: 暂无描述

Hugging Face2025-12-06 更新2025-12-20 收录

下载链接：

https://hf-mirror.com/datasets/Azure99/blossom-v6.3-sft-stage1

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 language: - zh - en task_categories: - text-generation size_categories: - 100K<n<1M --- # BLOSSOM V6.3 SFT STAGE1 ### Introduction BLOSSOM V6.3 SFT Stage1 is a high-quality, diverse large language model fine-tuning dataset designed for the first-stage SFT training of the Blossom V6.3 model. Its purpose is to help the model initially align dialogue capabilities through exposure to large-scale synthetic data. While open-source large language models often release model weights and technical reports, the most advanced open-source models typically withhold their pre-training and post-training data, making it difficult for the community to replicate their capabilities. Blossom is committed to providing researchers with reproducible post-training data for model capability development. **Data Sources**: WildChat, Wizard, Stackoverflow, Math, Magpie, InfinityPreference, Code, Flan, Olcc, Ruozhiba, etc. **Synthesis Workflow Overview**: Primarily employs three cost-effective models: Deepseek-V3.1, Gemini 2.5 Flash, and Qwen3-235B-A22B-Instruct-2507 (denoted as A, B, C)—to regenerate responses under different scenarios using tailored synthesis strategies. For example: - In objective scenarios like mathematics (where answers are unique), Model A first generates responses as a "teacher." If reference answers exist in the source data, Model B verifies the correctness of A's responses against them. If no reference answers exist, Model C generates a second response, and Model B checks consistency between A and C's outputs. Inconsistent responses are filtered out. - For subjective scenarios, three models cross-evaluate each other. For instance, Models A and B generate responses to a question, and Model C evaluates which is better. The superior response may be retained as training data or used for preference data construction. To mitigate model bias, roles (respondent/evaluator) are randomly assigned to A, B, and C in each instance. Additional rule-based filtering is applied, such as: - N-Gram filtering to remove data with many repetitions. - Discarding questions containing toxic content that triggers teacher model refusals. Further technical details will be released in the future. The data is synthesized by the [🌸BlossomData](https://github.com/Azure99/BlossomData) framework. ### Languages Primarily Chinese and English, with a roughly 1:1 ratio of Chinese-to-English data. ### Dataset Structure Each entry represents a conversational sample with the following fields: - `id`: Unique identifier combined with `metadata.source`. - `type`: Always set to `chat`. - `metadata`: Contains `source` indicating the data origin. - `messages`: A list of dialogue messages. Each message includes `role` (`user` or `assistant`) and `content` (text). ### Limitations This dataset is AI-generated. Despite preliminary validation and filtering, it may still contain inaccuracies or severe errors.

提供机构：

Azure99

5,000+

优质数据集

54 个

任务类型

进入经典数据集