FuseChat-Mixture
收藏魔搭社区2025-12-05 更新2025-02-01 收录
下载链接:
https://modelscope.cn/datasets/FuseAI/FuseChat-Mixture
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for FuseChat-Mixture
## Dataset Description
FuseChat-Mixture is the training dataset used in 📑[FuseChat: Knowledge Fusion of Chat Models](https://arxiv.org/abs/2402.16107)
[FuseChat-Mixture](https://huggingface.co/datasets/FuseAI/FuseChat-Mixture) is a comprehensive training dataset covers different styles and capabilities, featuring both human-written and model-generated, and spanning general instruction-following and specific skills. These sources include:
- [Orca-Best](https://huggingface.co/datasets/shahules786/orca-best): We sampled 20,000 examples from Orca-Best, which is filtered from the original GPT-4 (1M) partition of Orca based on maximum length and embedding clustering of instructions.
- [Capybara](https://huggingface.co/datasets/LDJnr/Capybara): We incorporated all the 16,000 examples of Capybara, which is a high-quality collection of multi-turn synthetic conversations.
- [No-Robots](https://huggingface.co/datasets/HuggingFaceH4/no_robots): We included all the 9,500 examples of No-Robots, which is a dataset created by skilled human annotators for supervised fine-tuning.
- [ShareGPT-GPT4](https://huggingface.co/datasets/shibing624/sharegpt_gpt4): We utilized all 6,200 examples from ShareGPT-GPT4, which exclusively uses dialogues generated by GPT-4 in ShareGPT.
- [Oasst-Top1](https://huggingface.co/datasets/OpenAssistant/oasst_top1_2023-08-25): We selected 5,000 examples from Oasst-Top1, which is a refined version of Oasst1, a human-annotated assistant-style conversation dataset.
- [MetaMathQA](https://huggingface.co/datasets/meta-math/MetaMathQA): We sampled 10,000 examples from MetaMathQA~\citep{yu2023metamath}, which is augmented from the GSM8K and MATH datasets for mathematics problem-solving.
- [OSS-Instruct](https://huggingface.co/datasets/ise-uiuc/Magicoder-OSS-Instruct-75K): We chose 10,000 examples from OSS-Instruct, which contains code instruction data synthesized from open-source code snippets.
- [Evol-Alpaca](https://huggingface.co/datasets/theblackcat102/evol-codealpaca-v1): We sampled 10,000 examples from Evol-Alpaca, which is a code instruction dataset generated by GPT-4 with evol-instruct proposed by WizardCoder.
- [Python-Code](https://huggingface.co/datasets/ajibawa-2023/Python-Code-23k-ShareGPT): We selected 10,000 examples from Python-Code, which comprises instructions and responses generated by GPT-3.5 and GPT-4 for python code generation.
We followed the data processing code in [Vicuna](https://github.com/lm-sys/FastChat/tree/main/fastchat/data) to clean instances containing non-English or special characters. Then, we split long conversations into blocks with a maximum length of 2048 tokens, resulting in the final FuseChat Mixture with 95,000 examples.
## Citation
If you find this work is relevant with your research or applications, please feel free to cite our work!
```
@article{wan2024fusechat,
title={FuseChat: Knowledge Fusion of Chat Models},
author={Fanqi Wan and Ziyi Yang and Longguang Zhong and Xiaojun Quan and Xinting Huang and Wei Bi},
journal={arXiv preprint arXiv:2402.16107},
year={2024}
}
```
# FuseChat-Mixture 数据集卡片
## 数据集描述
FuseChat-Mixture 是论文📑[《FuseChat:对话模型的知识融合》](https://arxiv.org/abs/2402.16107)中使用的训练数据集。
[FuseChat-Mixture](https://huggingface.co/datasets/FuseAI/FuseChat-Mixture) 是一套覆盖多样风格与能力的综合训练数据集,既包含人类撰写内容,也包含模型生成内容,涵盖通用指令遵循与特定技能任务,其来源包括:
- [Orca-Best](https://huggingface.co/datasets/shahules786/orca-best):我们从Orca-Best中采样20000条样本,该数据集基于Orca原始GPT-4分区(100万条),通过指令的最大长度与嵌入聚类进行筛选得到。
- [Capybara](https://huggingface.co/datasets/LDJnr/Capybara):我们采用Capybara的全部16000条样本,该数据集为高质量多轮合成对话集合。
- [No-Robots](https://huggingface.co/datasets/HuggingFaceH4/no_robots):我们纳入No-Robots的全部9500条样本,该数据集由专业人类标注者构建,用于监督微调任务。
- [ShareGPT-GPT4](https://huggingface.co/datasets/shibing624/sharegpt_gpt4):我们使用ShareGPT-GPT4的全部6200条样本,该数据集仅包含ShareGPT平台上由GPT-4生成的对话内容。
- [Oasst-Top1](https://huggingface.co/datasets/OpenAssistant/oasst_top1_2023-08-25):我们从Oasst-Top1中选取5000条样本,该数据集为Oasst1的精炼版本,Oasst1是由人类标注的助手风格对话数据集。
- [MetaMathQA](https://huggingface.co/datasets/meta-math/MetaMathQA):我们从MetaMathQA中采样10000条样本,该数据集基于GSM8K与MATH数据集扩充而来,用于数学问题求解任务~citep{yu2023metamath}。
- [OSS-Instruct](https://huggingface.co/datasets/ise-uiuc/Magicoder-OSS-Instruct-75K):我们从OSS-Instruct中选取10000条样本,该数据集包含从开源代码片段合成的代码指令数据。
- [Evol-Alpaca](https://huggingface.co/datasets/theblackcat102/evol-codealpaca-v1):我们从Evol-Alpaca中采样10000条样本,该数据集为GPT-4基于WizardCoder提出的进化指令(evol-instruct)生成的代码指令数据集。
- [Python-Code](https://huggingface.co/datasets/ajibawa-2023/Python-Code-23k-ShareGPT):我们从Python-Code中选取10000条样本,该数据集包含由GPT-3.5与GPT-4生成的用于Python代码生成的指令与回复内容。
我们遵循[Vicuna](https://github.com/lm-sys/FastChat/tree/main/fastchat/data)中的数据处理代码,清洗掉包含非英语或特殊字符的样本。随后,我们将长对话拆分为最大长度为2048 Token的块,最终得到包含95000条样本的FuseChat混合数据集。
## 引用
若您的研究或应用与本工作相关,请引用我们的成果!
@article{wan2024fusechat,
title={FuseChat: Knowledge Fusion of Chat Models},
author={Fanqi Wan and Ziyi Yang and Longguang Zhong and Xiaojun Quan and Xinting Huang and Wei Bi},
journal={arXiv preprint arXiv:2402.16107},
year={2024}
}
提供机构:
maas
创建时间:
2025-01-27



