five

DrewJin0827/MBD-ComposedSFT

收藏
Hugging Face2026-01-22 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/DrewJin0827/MBD-ComposedSFT
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en license: apache-2.0 task_categories: - text-generation tags: - instruction-tuning - supervised-fine-tuning size_categories: - 1M<n<10M --- # MBD-ComposedSFT ## Dataset Summary This dataset is a composed supervised fine-tuning (SFT) dataset that combines multiple high-quality instruction-following datasets. It comprises approximately 3,679,149 instruction tuning samples covering diverse domains including mathematics, coding, and general instruction following. ## Composition & Sources The dataset is constructed from the following sources: - smoltalk: 1,043,917 samples - OpenMathInstruct 2 1M: 999,117 samples - tulu 3 sft mixture: 937,730 samples - opc sft stage2: 436,347 samples - MathInstruct: 262,038 samples ## Data Format Each record follows this structure: ```json { "query": "instruction or question text", "response": "response or answer text" } ``` ## Citation If you use this dataset, please cite the original source datasets: - MathInstruct - OpenMathInstruct-2 - Tulu 3 - SmolTalk - OPC-SFT ## License This dataset is released under the Apache 2.0 license. ## Acknowledgement This dataset is derived from a thorough cleaning and re-processing of the ReFusion dataset. We acknowledge the creators and contributors of ReFusion for their foundational work, on which this dataset is based.
提供机构:
DrewJin0827
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作