FuseChat-Mixture

Name: FuseChat-Mixture
Creator: maas
Published: 2025-12-05 16:21:49
License: 暂无描述

魔搭社区2025-12-05 更新2025-02-01 收录

下载链接：

https://modelscope.cn/datasets/FuseAI/FuseChat-Mixture

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Card for FuseChat-Mixture ## Dataset Description FuseChat-Mixture is the training dataset used in 📑[FuseChat: Knowledge Fusion of Chat Models](https://arxiv.org/abs/2402.16107) [FuseChat-Mixture](https://huggingface.co/datasets/FuseAI/FuseChat-Mixture) is a comprehensive training dataset covers different styles and capabilities, featuring both human-written and model-generated, and spanning general instruction-following and specific skills. These sources include: - [Orca-Best](https://huggingface.co/datasets/shahules786/orca-best): We sampled 20,000 examples from Orca-Best, which is filtered from the original GPT-4 (1M) partition of Orca based on maximum length and embedding clustering of instructions. - [Capybara](https://huggingface.co/datasets/LDJnr/Capybara): We incorporated all the 16,000 examples of Capybara, which is a high-quality collection of multi-turn synthetic conversations. - [No-Robots](https://huggingface.co/datasets/HuggingFaceH4/no_robots): We included all the 9,500 examples of No-Robots, which is a dataset created by skilled human annotators for supervised fine-tuning. - [ShareGPT-GPT4](https://huggingface.co/datasets/shibing624/sharegpt_gpt4): We utilized all 6,200 examples from ShareGPT-GPT4, which exclusively uses dialogues generated by GPT-4 in ShareGPT. - [Oasst-Top1](https://huggingface.co/datasets/OpenAssistant/oasst_top1_2023-08-25): We selected 5,000 examples from Oasst-Top1, which is a refined version of Oasst1, a human-annotated assistant-style conversation dataset. - [MetaMathQA](https://huggingface.co/datasets/meta-math/MetaMathQA): We sampled 10,000 examples from MetaMathQA~\citep{yu2023metamath}, which is augmented from the GSM8K and MATH datasets for mathematics problem-solving. - [OSS-Instruct](https://huggingface.co/datasets/ise-uiuc/Magicoder-OSS-Instruct-75K): We chose 10,000 examples from OSS-Instruct, which contains code instruction data synthesized from open-source code snippets. - [Evol-Alpaca](https://huggingface.co/datasets/theblackcat102/evol-codealpaca-v1): We sampled 10,000 examples from Evol-Alpaca, which is a code instruction dataset generated by GPT-4 with evol-instruct proposed by WizardCoder. - [Python-Code](https://huggingface.co/datasets/ajibawa-2023/Python-Code-23k-ShareGPT): We selected 10,000 examples from Python-Code, which comprises instructions and responses generated by GPT-3.5 and GPT-4 for python code generation. We followed the data processing code in [Vicuna](https://github.com/lm-sys/FastChat/tree/main/fastchat/data) to clean instances containing non-English or special characters. Then, we split long conversations into blocks with a maximum length of 2048 tokens, resulting in the final FuseChat Mixture with 95,000 examples. ## Citation If you find this work is relevant with your research or applications, please feel free to cite our work! ``` @article{wan2024fusechat, title={FuseChat: Knowledge Fusion of Chat Models}, author={Fanqi Wan and Ziyi Yang and Longguang Zhong and Xiaojun Quan and Xinting Huang and Wei Bi}, journal={arXiv preprint arXiv:2402.16107}, year={2024} } ```

# FuseChat-Mixture 数据集卡片 ## 数据集描述 FuseChat-Mixture 是论文📑[《FuseChat：对话模型的知识融合》](https://arxiv.org/abs/2402.16107)中使用的训练数据集。 [FuseChat-Mixture](https://huggingface.co/datasets/FuseAI/FuseChat-Mixture) 是一套覆盖多样风格与能力的综合训练数据集，既包含人类撰写内容，也包含模型生成内容，涵盖通用指令遵循与特定技能任务，其来源包括： - [Orca-Best](https://huggingface.co/datasets/shahules786/orca-best)：我们从Orca-Best中采样20000条样本，该数据集基于Orca原始GPT-4分区（100万条），通过指令的最大长度与嵌入聚类进行筛选得到。 - [Capybara](https://huggingface.co/datasets/LDJnr/Capybara)：我们采用Capybara的全部16000条样本，该数据集为高质量多轮合成对话集合。 - [No-Robots](https://huggingface.co/datasets/HuggingFaceH4/no_robots)：我们纳入No-Robots的全部9500条样本，该数据集由专业人类标注者构建，用于监督微调任务。 - [ShareGPT-GPT4](https://huggingface.co/datasets/shibing624/sharegpt_gpt4)：我们使用ShareGPT-GPT4的全部6200条样本，该数据集仅包含ShareGPT平台上由GPT-4生成的对话内容。 - [Oasst-Top1](https://huggingface.co/datasets/OpenAssistant/oasst_top1_2023-08-25)：我们从Oasst-Top1中选取5000条样本，该数据集为Oasst1的精炼版本，Oasst1是由人类标注的助手风格对话数据集。 - [MetaMathQA](https://huggingface.co/datasets/meta-math/MetaMathQA)：我们从MetaMathQA中采样10000条样本，该数据集基于GSM8K与MATH数据集扩充而来，用于数学问题求解任务~citep{yu2023metamath}。 - [OSS-Instruct](https://huggingface.co/datasets/ise-uiuc/Magicoder-OSS-Instruct-75K)：我们从OSS-Instruct中选取10000条样本，该数据集包含从开源代码片段合成的代码指令数据。 - [Evol-Alpaca](https://huggingface.co/datasets/theblackcat102/evol-codealpaca-v1)：我们从Evol-Alpaca中采样10000条样本，该数据集为GPT-4基于WizardCoder提出的进化指令（evol-instruct）生成的代码指令数据集。 - [Python-Code](https://huggingface.co/datasets/ajibawa-2023/Python-Code-23k-ShareGPT)：我们从Python-Code中选取10000条样本，该数据集包含由GPT-3.5与GPT-4生成的用于Python代码生成的指令与回复内容。我们遵循[Vicuna](https://github.com/lm-sys/FastChat/tree/main/fastchat/data)中的数据处理代码，清洗掉包含非英语或特殊字符的样本。随后，我们将长对话拆分为最大长度为2048 Token的块，最终得到包含95000条样本的FuseChat混合数据集。 ## 引用若您的研究或应用与本工作相关，请引用我们的成果！ @article{wan2024fusechat, title={FuseChat: Knowledge Fusion of Chat Models}, author={Fanqi Wan and Ziyi Yang and Longguang Zhong and Xiaojun Quan and Xinting Huang and Wei Bi}, journal={arXiv preprint arXiv:2402.16107}, year={2024} }

提供机构：

maas

创建时间：

2025-01-27

5,000+

优质数据集

54 个

任务类型

进入经典数据集