five

tulu-v1-sft-mixture

收藏
魔搭社区2025-12-05 更新2025-05-31 收录
下载链接:
https://modelscope.cn/datasets/allenai/tulu-v1-sft-mixture
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Card for Tulu Instruction Mix **For a newer version, see [Tulu V2](https://huggingface.co/datasets/allenai/tulu-v2)** This version, the human data mixture, dataset consists of a mix of: * [FLAN](https://github.com/google-research/FLAN/tree/main) (Apache 2.0): FLAN v2 with CoT examples (most of the tasks in SuperNatural Instructions are included here) * [Open Assistant 1](https://huggingface.co/datasets/OpenAssistant/oasst1) (Apache 2.0) * [Dolly](https://huggingface.co/datasets/databricks/databricks-dolly-15k) (CC By SA 3.0) * [ShareGPT](https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered) (Apache 2.0 listed, no official repo found) * [GPT4-Alpaca](https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM#data-release) (CC By NC 4.0) * [Code-Alpaca](https://github.com/sahil280114/codealpaca) (CC By NC 4.0) These are made by taking either just the training set of the subsets or the entire section if no splits are present. For more information, see the paper [How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources ](https://arxiv.org/abs/2306.04751). ### License We are releasing this dataset under the terms of [ODC-BY](https://opendatacommons.org/licenses/by/1-0/). By using this, you are also bound by the [Common Crawl terms of use](https://commoncrawl.org/terms-of-use/) in respect of the content contained in the dataset.

# Tulu指令混合数据集卡片 如需更新版本,请参阅[Tulu V2](https://huggingface.co/datasets/allenai/tulu-v2) 本版本为人类数据混合数据集,由以下数据源混合组成: * [FLAN](https://github.com/google-research/FLAN/tree/main)(Apache 2.0协议):包含思维链(Chain of Thought, CoT)示例的FLAN v2版本(本数据集中涵盖了超自然指令(SuperNatural Instructions)中的绝大多数任务) * [Open Assistant 1](https://huggingface.co/datasets/OpenAssistant/oasst1)(Apache 2.0协议) * [Dolly](https://huggingface.co/datasets/databricks/databricks-dolly-15k)(CC BY-SA 3.0协议) * [ShareGPT](https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered)(标注为Apache 2.0协议,未找到官方仓库) * [GPT4-Alpaca](https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM#data-release)(CC BY-NC 4.0协议) * [Code-Alpaca](https://github.com/sahil280114/codealpaca)(CC BY-NC 4.0协议) 我们将各子数据集的训练集取出,若子数据集未划分训练集、验证集、测试集等拆分形式,则直接采用其全部数据。 如需了解更多细节,请参阅论文《骆驼能走多远?探索开源资源上的指令微调现状》(How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources)[https://arxiv.org/abs/2306.04751]。 ### 许可协议 本数据集依据[ODC-BY](https://opendatacommons.org/licenses/by/1-0/)协议发布。使用本数据集的用户,同时需遵守[Common Crawl使用条款](https://commoncrawl.org/terms-of-use/)中针对数据集所含内容的相关规定。
提供机构:
maas
创建时间:
2025-05-29
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作