tulu-v1-sft-mixture
收藏魔搭社区2025-12-05 更新2025-05-31 收录
下载链接:
https://modelscope.cn/datasets/allenai/tulu-v1-sft-mixture
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for Tulu Instruction Mix
**For a newer version, see [Tulu V2](https://huggingface.co/datasets/allenai/tulu-v2)**
This version, the human data mixture, dataset consists of a mix of:
* [FLAN](https://github.com/google-research/FLAN/tree/main) (Apache 2.0): FLAN v2 with CoT examples (most of the tasks in SuperNatural Instructions are included here)
* [Open Assistant 1](https://huggingface.co/datasets/OpenAssistant/oasst1) (Apache 2.0)
* [Dolly](https://huggingface.co/datasets/databricks/databricks-dolly-15k) (CC By SA 3.0)
* [ShareGPT](https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered) (Apache 2.0 listed, no official repo found)
* [GPT4-Alpaca](https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM#data-release) (CC By NC 4.0)
* [Code-Alpaca](https://github.com/sahil280114/codealpaca) (CC By NC 4.0)
These are made by taking either just the training set of the subsets or the entire section if no splits are present.
For more information, see the paper [How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources
](https://arxiv.org/abs/2306.04751).
### License
We are releasing this dataset under the terms of [ODC-BY](https://opendatacommons.org/licenses/by/1-0/). By using this, you are also bound by the [Common Crawl terms of use](https://commoncrawl.org/terms-of-use/) in respect of the content contained in the dataset.
# Tulu指令混合数据集卡片
如需更新版本,请参阅[Tulu V2](https://huggingface.co/datasets/allenai/tulu-v2)
本版本为人类数据混合数据集,由以下数据源混合组成:
* [FLAN](https://github.com/google-research/FLAN/tree/main)(Apache 2.0协议):包含思维链(Chain of Thought, CoT)示例的FLAN v2版本(本数据集中涵盖了超自然指令(SuperNatural Instructions)中的绝大多数任务)
* [Open Assistant 1](https://huggingface.co/datasets/OpenAssistant/oasst1)(Apache 2.0协议)
* [Dolly](https://huggingface.co/datasets/databricks/databricks-dolly-15k)(CC BY-SA 3.0协议)
* [ShareGPT](https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered)(标注为Apache 2.0协议,未找到官方仓库)
* [GPT4-Alpaca](https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM#data-release)(CC BY-NC 4.0协议)
* [Code-Alpaca](https://github.com/sahil280114/codealpaca)(CC BY-NC 4.0协议)
我们将各子数据集的训练集取出,若子数据集未划分训练集、验证集、测试集等拆分形式,则直接采用其全部数据。
如需了解更多细节,请参阅论文《骆驼能走多远?探索开源资源上的指令微调现状》(How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources)[https://arxiv.org/abs/2306.04751]。
### 许可协议
本数据集依据[ODC-BY](https://opendatacommons.org/licenses/by/1-0/)协议发布。使用本数据集的用户,同时需遵守[Common Crawl使用条款](https://commoncrawl.org/terms-of-use/)中针对数据集所含内容的相关规定。
提供机构:
maas
创建时间:
2025-05-29



