five

FuseChat-Mixture

收藏
魔搭社区2025-12-05 更新2025-02-01 收录
下载链接:
https://modelscope.cn/datasets/FuseAI/FuseChat-Mixture
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Card for FuseChat-Mixture ## Dataset Description FuseChat-Mixture is the training dataset used in 📑[FuseChat: Knowledge Fusion of Chat Models](https://arxiv.org/abs/2402.16107) [FuseChat-Mixture](https://huggingface.co/datasets/FuseAI/FuseChat-Mixture) is a comprehensive training dataset covers different styles and capabilities, featuring both human-written and model-generated, and spanning general instruction-following and specific skills. These sources include: - [Orca-Best](https://huggingface.co/datasets/shahules786/orca-best): We sampled 20,000 examples from Orca-Best, which is filtered from the original GPT-4 (1M) partition of Orca based on maximum length and embedding clustering of instructions. - [Capybara](https://huggingface.co/datasets/LDJnr/Capybara): We incorporated all the 16,000 examples of Capybara, which is a high-quality collection of multi-turn synthetic conversations. - [No-Robots](https://huggingface.co/datasets/HuggingFaceH4/no_robots): We included all the 9,500 examples of No-Robots, which is a dataset created by skilled human annotators for supervised fine-tuning. - [ShareGPT-GPT4](https://huggingface.co/datasets/shibing624/sharegpt_gpt4): We utilized all 6,200 examples from ShareGPT-GPT4, which exclusively uses dialogues generated by GPT-4 in ShareGPT. - [Oasst-Top1](https://huggingface.co/datasets/OpenAssistant/oasst_top1_2023-08-25): We selected 5,000 examples from Oasst-Top1, which is a refined version of Oasst1, a human-annotated assistant-style conversation dataset. - [MetaMathQA](https://huggingface.co/datasets/meta-math/MetaMathQA): We sampled 10,000 examples from MetaMathQA~\citep{yu2023metamath}, which is augmented from the GSM8K and MATH datasets for mathematics problem-solving. - [OSS-Instruct](https://huggingface.co/datasets/ise-uiuc/Magicoder-OSS-Instruct-75K): We chose 10,000 examples from OSS-Instruct, which contains code instruction data synthesized from open-source code snippets. - [Evol-Alpaca](https://huggingface.co/datasets/theblackcat102/evol-codealpaca-v1): We sampled 10,000 examples from Evol-Alpaca, which is a code instruction dataset generated by GPT-4 with evol-instruct proposed by WizardCoder. - [Python-Code](https://huggingface.co/datasets/ajibawa-2023/Python-Code-23k-ShareGPT): We selected 10,000 examples from Python-Code, which comprises instructions and responses generated by GPT-3.5 and GPT-4 for python code generation. We followed the data processing code in [Vicuna](https://github.com/lm-sys/FastChat/tree/main/fastchat/data) to clean instances containing non-English or special characters. Then, we split long conversations into blocks with a maximum length of 2048 tokens, resulting in the final FuseChat Mixture with 95,000 examples. ## Citation If you find this work is relevant with your research or applications, please feel free to cite our work! ``` @article{wan2024fusechat, title={FuseChat: Knowledge Fusion of Chat Models}, author={Fanqi Wan and Ziyi Yang and Longguang Zhong and Xiaojun Quan and Xinting Huang and Wei Bi}, journal={arXiv preprint arXiv:2402.16107}, year={2024} } ```

# FuseChat-Mixture 数据集卡片 ## 数据集描述 FuseChat-Mixture 是论文📑[《FuseChat:对话模型的知识融合》](https://arxiv.org/abs/2402.16107)中使用的训练数据集。 [FuseChat-Mixture](https://huggingface.co/datasets/FuseAI/FuseChat-Mixture) 是一套覆盖多样风格与能力的综合训练数据集,既包含人类撰写内容,也包含模型生成内容,涵盖通用指令遵循与特定技能任务,其来源包括: - [Orca-Best](https://huggingface.co/datasets/shahules786/orca-best):我们从Orca-Best中采样20000条样本,该数据集基于Orca原始GPT-4分区(100万条),通过指令的最大长度与嵌入聚类进行筛选得到。 - [Capybara](https://huggingface.co/datasets/LDJnr/Capybara):我们采用Capybara的全部16000条样本,该数据集为高质量多轮合成对话集合。 - [No-Robots](https://huggingface.co/datasets/HuggingFaceH4/no_robots):我们纳入No-Robots的全部9500条样本,该数据集由专业人类标注者构建,用于监督微调任务。 - [ShareGPT-GPT4](https://huggingface.co/datasets/shibing624/sharegpt_gpt4):我们使用ShareGPT-GPT4的全部6200条样本,该数据集仅包含ShareGPT平台上由GPT-4生成的对话内容。 - [Oasst-Top1](https://huggingface.co/datasets/OpenAssistant/oasst_top1_2023-08-25):我们从Oasst-Top1中选取5000条样本,该数据集为Oasst1的精炼版本,Oasst1是由人类标注的助手风格对话数据集。 - [MetaMathQA](https://huggingface.co/datasets/meta-math/MetaMathQA):我们从MetaMathQA中采样10000条样本,该数据集基于GSM8K与MATH数据集扩充而来,用于数学问题求解任务~citep{yu2023metamath}。 - [OSS-Instruct](https://huggingface.co/datasets/ise-uiuc/Magicoder-OSS-Instruct-75K):我们从OSS-Instruct中选取10000条样本,该数据集包含从开源代码片段合成的代码指令数据。 - [Evol-Alpaca](https://huggingface.co/datasets/theblackcat102/evol-codealpaca-v1):我们从Evol-Alpaca中采样10000条样本,该数据集为GPT-4基于WizardCoder提出的进化指令(evol-instruct)生成的代码指令数据集。 - [Python-Code](https://huggingface.co/datasets/ajibawa-2023/Python-Code-23k-ShareGPT):我们从Python-Code中选取10000条样本,该数据集包含由GPT-3.5与GPT-4生成的用于Python代码生成的指令与回复内容。 我们遵循[Vicuna](https://github.com/lm-sys/FastChat/tree/main/fastchat/data)中的数据处理代码,清洗掉包含非英语或特殊字符的样本。随后,我们将长对话拆分为最大长度为2048 Token的块,最终得到包含95000条样本的FuseChat混合数据集。 ## 引用 若您的研究或应用与本工作相关,请引用我们的成果! @article{wan2024fusechat, title={FuseChat: Knowledge Fusion of Chat Models}, author={Fanqi Wan and Ziyi Yang and Longguang Zhong and Xiaojun Quan and Xinting Huang and Wei Bi}, journal={arXiv preprint arXiv:2402.16107}, year={2024} }
提供机构:
maas
创建时间:
2025-01-27
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作