five

Azure99/blossom-v6.3-sft-stage1

收藏
Hugging Face2025-12-06 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/Azure99/blossom-v6.3-sft-stage1
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 language: - zh - en task_categories: - text-generation size_categories: - 100K<n<1M --- # BLOSSOM V6.3 SFT STAGE1 ### Introduction BLOSSOM V6.3 SFT Stage1 is a high-quality, diverse large language model fine-tuning dataset designed for the first-stage SFT training of the Blossom V6.3 model. Its purpose is to help the model initially align dialogue capabilities through exposure to large-scale synthetic data. While open-source large language models often release model weights and technical reports, the most advanced open-source models typically withhold their pre-training and post-training data, making it difficult for the community to replicate their capabilities. Blossom is committed to providing researchers with reproducible post-training data for model capability development. **Data Sources**: WildChat, Wizard, Stackoverflow, Math, Magpie, InfinityPreference, Code, Flan, Olcc, Ruozhiba, etc. **Synthesis Workflow Overview**: Primarily employs three cost-effective models: Deepseek-V3.1, Gemini 2.5 Flash, and Qwen3-235B-A22B-Instruct-2507 (denoted as A, B, C)—to regenerate responses under different scenarios using tailored synthesis strategies. For example: - In objective scenarios like mathematics (where answers are unique), Model A first generates responses as a "teacher." If reference answers exist in the source data, Model B verifies the correctness of A's responses against them. If no reference answers exist, Model C generates a second response, and Model B checks consistency between A and C's outputs. Inconsistent responses are filtered out. - For subjective scenarios, three models cross-evaluate each other. For instance, Models A and B generate responses to a question, and Model C evaluates which is better. The superior response may be retained as training data or used for preference data construction. To mitigate model bias, roles (respondent/evaluator) are randomly assigned to A, B, and C in each instance. Additional rule-based filtering is applied, such as: - N-Gram filtering to remove data with many repetitions. - Discarding questions containing toxic content that triggers teacher model refusals. Further technical details will be released in the future. The data is synthesized by the [🌸BlossomData](https://github.com/Azure99/BlossomData) framework. ### Languages Primarily Chinese and English, with a roughly 1:1 ratio of Chinese-to-English data. ### Dataset Structure Each entry represents a conversational sample with the following fields: - `id`: Unique identifier combined with `metadata.source`. - `type`: Always set to `chat`. - `metadata`: Contains `source` indicating the data origin. - `messages`: A list of dialogue messages. Each message includes `role` (`user` or `assistant`) and `content` (text). ### Limitations This dataset is AI-generated. Despite preliminary validation and filtering, it may still contain inaccuracies or severe errors.
提供机构:
Azure99
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作