five

FuseChat-Mixture-OpenChat-3.5-7B-Representation

收藏
魔搭社区2025-12-05 更新2025-02-01 收录
下载链接:
https://modelscope.cn/datasets/FuseAI/FuseChat-Mixture-OpenChat-3.5-7B-Representation
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Card for FuseChat-Mixture ## Dataset Description FuseChat-Mixture is the training dataset used in 📑[FuseChat: Knowledge Fusion of Chat Models](https://arxiv.org/abs/2402.16107) [FuseChat-Mixture](https://huggingface.co/datasets/FuseAI/FuseChat-Mixture) is a comprehensive training dataset covers different styles and capabilities, featuring both human-written and model-generated, and spanning general instruction-following and specific skills. These sources include: - [Orca-Best](https://huggingface.co/datasets/shahules786/orca-best): We sampled 20,000 examples from Orca-Best, which is filtered from the original GPT-4 (1M) partition of Orca based on maximum length and embedding clustering of instructions. - [Capybara](https://huggingface.co/datasets/LDJnr/Capybara): We incorporated all the 16,000 examples of Capybara, which is a high-quality collection of multi-turn synthetic conversations. - [No-Robots](https://huggingface.co/datasets/HuggingFaceH4/no_robots): We included all the 9,500 examples of No-Robots, which is a dataset created by skilled human annotators for supervised fine-tuning. - [ShareGPT-GPT4](https://huggingface.co/datasets/shibing624/sharegpt_gpt4): We utilized all 6,200 examples from ShareGPT-GPT4, which exclusively uses dialogues generated by GPT-4 in ShareGPT. - [Oasst-Top1](https://huggingface.co/datasets/OpenAssistant/oasst_top1_2023-08-25): We selected 5,000 examples from Oasst-Top1, which is a refined version of Oasst1, a human-annotated assistant-style conversation dataset. - [MetaMathQA](https://huggingface.co/datasets/meta-math/MetaMathQA): We sampled 10,000 examples from MetaMathQA~\citep{yu2023metamath}, which is augmented from the GSM8K and MATH datasets for mathematics problem-solving. - [OSS-Instruct](https://huggingface.co/datasets/ise-uiuc/Magicoder-OSS-Instruct-75K): We chose 10,000 examples from OSS-Instruct, which contains code instruction data synthesized from open-source code snippets. - [Evol-Alpaca](https://huggingface.co/datasets/theblackcat102/evol-codealpaca-v1): We sampled 10,000 examples from Evol-Alpaca, which is a code instruction dataset generated by GPT-4 with evol-instruct proposed by WizardCoder. - [Python-Code](https://huggingface.co/datasets/ajibawa-2023/Python-Code-23k-ShareGPT): We selected 10,000 examples from Python-Code, which comprises instructions and responses generated by GPT-3.5 and GPT-4 for python code generation. We followed the data processing code in [Vicuna](https://github.com/lm-sys/FastChat/tree/main/fastchat/data) to clean instances containing non-English or special characters. Then, we split long conversations into blocks with a maximum length of 2048 tokens, resulting in the final FuseChat Mixture with 95,000 examples. ## Citation If you find this work is relevant with your research or applications, please feel free to cite our work! ``` @article{wan2024fusechat, title={FuseChat: Knowledge Fusion of Chat Models}, author={Fanqi Wan and Ziyi Yang and Longguang Zhong and Xiaojun Quan and Xinting Huang and Wei Bi}, journal={arXiv preprint arXiv:2402.16107}, year={2024} } ```

# FuseChat-Mixture 数据集卡片 ## 数据集描述 FuseChat-Mixture 是论文📑[《FuseChat:聊天模型的知识融合》(FuseChat: Knowledge Fusion of Chat Models)](https://arxiv.org/abs/2402.16107) 中使用的训练数据集。 [FuseChat-Mixture](https://huggingface.co/datasets/FuseAI/FuseChat-Mixture) 是一套覆盖多种风格与能力的综合性训练数据集,兼具人工撰写与模型生成内容,涵盖通用指令遵循与专项技能场景,其数据来源包括: - [Orca-Best](https://huggingface.co/datasets/shahules786/orca-best):我们从Orca-Best中采样20000条样本。该数据集基于Orca原始GPT-4(100万条)分区,通过指令的最大长度与嵌入聚类进行筛选得到。 - [Capybara](https://huggingface.co/datasets/LDJnr/Capybara):我们完整引入Capybara的全部16000条样本,该数据集为高质量多轮合成对话集合。 - [No-Robots](https://huggingface.co/datasets/HuggingFaceH4/no_robots):我们纳入No-Robots的全部9500条样本,该数据集由专业人工标注者构建,用于监督微调。 - [ShareGPT-GPT4](https://huggingface.co/datasets/shibing624/sharegpt_gpt4):我们使用ShareGPT-GPT4的全部6200条样本,该数据集仅包含ShareGPT平台上由GPT-4生成的对话内容。 - [Oasst-Top1](https://huggingface.co/datasets/OpenAssistant/oasst_top1_2023-08-25):我们从Oasst-Top1中选取5000条样本。该数据集是Oasst1的精炼版本,Oasst1为人工标注的助手风格对话数据集。 - [MetaMathQA](https://huggingface.co/datasets/meta-math/MetaMathQA):我们从MetaMathQA中采样10000条样本。该数据集基于GSM8K与MATH数据集扩充得到,用于数学问题求解citep{yu2023metamath}。 - [OSS-Instruct](https://huggingface.co/datasets/ise-uiuc/Magicoder-OSS-Instruct-75K):我们从OSS-Instruct中选取10000条样本,该数据集包含由开源代码片段合成的代码指令数据。 - [Evol-Alpaca](https://huggingface.co/datasets/theblackcat102/evol-codealpaca-v1):我们从Evol-Alpaca中采样10000条样本。该数据集是由GPT-4基于WizardCoder提出的进化指令(evol-instruct)生成的代码指令数据集。 - [Python-Code](https://huggingface.co/datasets/ajibawa-2023/Python-Code-23k-ShareGPT):我们从Python-Code中选取10000条样本,该数据集包含由GPT-3.5与GPT-4生成的Python代码生成相关指令与回复内容。 我们参照[Vicuna](https://github.com/lm-sys/FastChat/tree/main/fastchat/data) 的数据处理代码,清洗掉包含非英语或特殊字符的样本。随后将长对话切割为最大长度为2048个词元(Token)的块,最终得到包含95000条样本的FuseChat-Mixture数据集。 ## 引用说明 若您的研究或应用与本工作相关,欢迎引用我们的成果! @article{wan2024fusechat, title={FuseChat: Knowledge Fusion of Chat Models}, author={Fanqi Wan and Ziyi Yang and Longguang Zhong and Xiaojun Quan and Xinting Huang and Wei Bi}, journal={arXiv preprint arXiv:2402.16107}, year={2024} }
提供机构:
maas
创建时间:
2025-01-27
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作