tulu-v1-sft-mixture

Name: tulu-v1-sft-mixture
Creator: maas
Published: 2025-12-05 16:36:46
License: 暂无描述

魔搭社区2025-12-05 更新2025-05-31 收录

下载链接：

https://modelscope.cn/datasets/allenai/tulu-v1-sft-mixture

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Card for Tulu Instruction Mix **For a newer version, see [Tulu V2](https://huggingface.co/datasets/allenai/tulu-v2)** This version, the human data mixture, dataset consists of a mix of: * [FLAN](https://github.com/google-research/FLAN/tree/main) (Apache 2.0): FLAN v2 with CoT examples (most of the tasks in SuperNatural Instructions are included here) * [Open Assistant 1](https://huggingface.co/datasets/OpenAssistant/oasst1) (Apache 2.0) * [Dolly](https://huggingface.co/datasets/databricks/databricks-dolly-15k) (CC By SA 3.0) * [ShareGPT](https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered) (Apache 2.0 listed, no official repo found) * [GPT4-Alpaca](https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM#data-release) (CC By NC 4.0) * [Code-Alpaca](https://github.com/sahil280114/codealpaca) (CC By NC 4.0) These are made by taking either just the training set of the subsets or the entire section if no splits are present. For more information, see the paper [How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources ](https://arxiv.org/abs/2306.04751). ### License We are releasing this dataset under the terms of [ODC-BY](https://opendatacommons.org/licenses/by/1-0/). By using this, you are also bound by the [Common Crawl terms of use](https://commoncrawl.org/terms-of-use/) in respect of the content contained in the dataset.

# Tulu指令混合数据集卡片如需更新版本，请参阅[Tulu V2](https://huggingface.co/datasets/allenai/tulu-v2) 本版本为人类数据混合数据集，由以下数据源混合组成： * [FLAN](https://github.com/google-research/FLAN/tree/main)（Apache 2.0协议）：包含思维链（Chain of Thought, CoT）示例的FLAN v2版本（本数据集中涵盖了超自然指令（SuperNatural Instructions）中的绝大多数任务） * [Open Assistant 1](https://huggingface.co/datasets/OpenAssistant/oasst1)（Apache 2.0协议） * [Dolly](https://huggingface.co/datasets/databricks/databricks-dolly-15k)（CC BY-SA 3.0协议） * [ShareGPT](https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered)（标注为Apache 2.0协议，未找到官方仓库） * [GPT4-Alpaca](https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM#data-release)（CC BY-NC 4.0协议） * [Code-Alpaca](https://github.com/sahil280114/codealpaca)（CC BY-NC 4.0协议）我们将各子数据集的训练集取出，若子数据集未划分训练集、验证集、测试集等拆分形式，则直接采用其全部数据。如需了解更多细节，请参阅论文《骆驼能走多远？探索开源资源上的指令微调现状》（How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources）[https://arxiv.org/abs/2306.04751]。 ### 许可协议本数据集依据[ODC-BY](https://opendatacommons.org/licenses/by/1-0/)协议发布。使用本数据集的用户，同时需遵守[Common Crawl使用条款](https://commoncrawl.org/terms-of-use/)中针对数据集所含内容的相关规定。

提供机构：

maas

创建时间：

2025-05-29

5,000+

优质数据集

54 个

任务类型

进入经典数据集