tulu-v2-sft-mixture
收藏魔搭社区2026-05-23 更新2024-05-15 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/tulu-v2-sft-mixture
下载链接
链接失效反馈官方服务:
资源简介:
<img src="https://huggingface.co/datasets/allenai/blog-images/resolve/main/tulu-v2/Tulu%20V2%20banner.png" alt="TuluV2 banner" width="800" style="margin-left:'auto' margin-right:'auto' display:'block'"/>
# Dataset Card for Tulu V2 Mix
*Note the [ODC-BY license](https://opendatacommons.org/licenses/by/1-0/), indicating that different licenses apply to subsets of the data. This means that some portions of the dataset are non-commercial. We present the mixture as a research artifact.*
Tulu is a series of language models that are trained to act as helpful assistants.
The dataset consists of a mix of :
* [FLAN](https://github.com/google-research/FLAN/tree/main) (Apache 2.0): We use 50,000 examples sampled from FLAN v2. To emphasize CoT-style reasoning, we sample another 50,000 examples from the CoT
subset of the FLAN v2 mixture.
* [Open Assistant 1](https://huggingface.co/datasets/OpenAssistant/oasst1) (Apache 2.0): We isolate the highest-scoring paths in each conversation tree and use these samples, resulting in 7,708 examples.
* [ShareGPT](https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered) (Apache 2.0 listed, no official repo found): We use all 114,046 from our processed ShareGPT dataset, as we found ShareGPT gave strong performance in prior work.
* [GPT4-Alpaca](https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM#data-release) (CC By NC 4.0):We sample 20,000 samples from GPT-4 Alpaca to further include distilled GPT-4 data.
* [Code-Alpaca](https://github.com/sahil280114/codealpaca) (CC By NC 4.0):We use all 20,022 examples from Code Alpaca, following our prior V1 mixture, in order to improve model code abilities.
* [LIMA](https://huggingface.co/datasets/GAIR/lima) (CC BY-NC-SA): We use 1,030 examples from LIMA as an example of carefully curated data.
* [WizardLM Evol Instruct](https://huggingface.co/datasets/WizardLM/WizardLM_evol_instruct_V2_196k) (No license provided): We subsample 30,000 examples from WizardLM, which contains distilled data of increasing diversity and complexity.
* [Open-Orca](https://huggingface.co/datasets/Open-Orca/OpenOrca) (MIT): We sample 30,000 samples generated by GPT-4 from OpenOrca, a reproduction of Orca Mukherjee et al., 2023, which augments FLAN data with additional model-generated explanations
* Hardcoded: A collection of prompts such as `Tell me about yourself' with 140 total samples manually written by the authors, such that the model generates correct outputs given inquiries about its name or developers.
* Science: 7,544 examples from a mixture of scientific document understand tasks—including question answering, fact-checking, summarization, and information extraction (under development, standalone release soon).
These are made by taking either just the training set of the subsets or the entire section if no splits are present.
Tulu V2 is presented as a singular training split.
[Tulu V2 DPO 70B](https://huggingface.co/allenai/tulu-2-dpo-70b), and is a fine-tuned version of Llama 2 that was trained on on a mix of publicly available, synthetic and human datasets using [Direct Preference Optimization (DPO)](https://arxiv.org/abs/2305.18290).
**Model Family:** Other models and the dataset are found in the [Tulu V2 collection](https://huggingface.co/collections/allenai/tulu-v2-suite-6551b56e743e6349aab45101).
The length distribution of the dataset can be seen below:
<img src="https://huggingface.co/datasets/allenai/blog-images/resolve/main/tulu-v2/length_histogram_v2.png" alt="TuluV2 histogram" width="600" style="margin-left:'auto' margin-right:'auto' display:'block'"/>
Tulu V1 Mix can be found [here](https://huggingface.co/datasets/allenai/tulu-v1).
### Science data note
The included science data is from the following categories:
<img src="https://huggingface.co/datasets/allenai/blog-images/resolve/main/tulu-v2/science_data.png" alt="TuluV2 science data mix" width="600" style="margin-left:'auto' margin-right:'auto' display:'block'"/>
Note that some of the examples include an off-by-one error in the sentence indexing that had a small or negligible impact on performance.
This was found during testing and will be updated in future versions, with the detailed release of the dataset artifact itself coming in a future release.
### License
We are releasing this dataset under the terms of [ODC-BY](https://opendatacommons.org/licenses/by/1-0/). By using this, you are also bound by the [Common Crawl terms of use](https://commoncrawl.org/terms-of-use/) in respect of the content contained in the dataset.
# Tulu V2 Mix 数据集卡片
*请注意本数据集采用[ODC-BY许可](https://opendatacommons.org/licenses/by/1-0/),这意味着数据集的不同子集适用不同的许可协议,部分数据子集仅可用于非商业用途。本数据集仅作为研究成果发布。
Tulu是一系列被训练为实用助手的语言模型系列。本数据集由以下混合数据组成:
* [FLAN](https://github.com/google-research/FLAN/tree/main)(Apache 2.0许可):我们从FLAN v2中采样50000条样本。为了强化思维链(Chain of Thought, CoT)风格的推理能力,我们额外从FLAN v2混合数据集的思维链子集中采样了50000条样本。
* [Open Assistant 1](https://huggingface.co/datasets/OpenAssistant/oasst1)(Apache 2.0许可):我们提取每个对话树中得分最高的路径作为样本,最终得到7708条样本。
* [ShareGPT](https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered)(标注为Apache 2.0许可,但未找到官方仓库):我们使用经过处理的ShareGPT数据集中的全部114046条样本,过往研究表明ShareGPT数据集能带来出色的模型性能表现。
* [GPT4-Alpaca](https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM#data-release)(CC BY NC 4.0许可):我们从GPT-4 Alpaca中采样20000条样本,以纳入经GPT-4蒸馏得到的数据。
* [Code-Alpaca](https://github.com/sahil280114/codealpaca)(CC BY NC 4.0许可):我们沿用V1版本混合数据集的设置,使用Code Alpaca的全部20022条样本,以提升模型的代码生成能力。
* [LIMA](https://huggingface.co/datasets/GAIR/lima)(CC BY-NC-SA许可):我们使用LIMA中的1030条样本,该数据集属于精心筛选(curated)的数据的典型代表。
* [WizardLM Evol Instruct](https://huggingface.co/datasets/WizardLM/WizardLM_evol_instruct_V2_196k)(未提供许可协议):我们从WizardLM数据集中采样30000条样本,该数据集包含多样性与复杂度逐步提升的经蒸馏得到的数据。
* [Open-Orca](https://huggingface.co/datasets/Open-Orca/OpenOrca)(MIT许可):我们从Open-Orca中采样30000条由GPT-4生成的样本,该数据集复刻了Orca(Mukherjee等人,2023)的研究成果,通过额外的模型生成解释来扩充FLAN数据集。
* 硬编码提示(Hardcoded):由作者手动编写的140条提示词集合,例如`Tell me about yourself`,用于确保模型在被问及自身名称或开发者信息时能生成正确的回复。
* 科学任务数据集:包含7544条来自混合科学文档理解任务的样本,涵盖问答、事实核查、摘要与信息提取(处于开发阶段,独立版本即将发布)。
我们要么使用各子集的训练集,要么在无划分的情况下使用全部数据。Tulu V2仅提供单一训练划分版本。[Tulu V2 DPO 70B](https://huggingface.co/allenai/tulu-2-dpo-70b)是Llama 2的微调版本,基于公开可用的合成数据与人类标注数据集,通过[直接偏好优化(Direct Preference Optimization, DPO)](https://arxiv.org/abs/2305.18290)训练得到。
**模型家族**:本系列的其他模型与数据集可在[Tulu V2数据集合集](https://huggingface.co/collections/allenai/tulu-v2-suite-6551b56e743e6349aab45101)中获取。
数据集的长度分布如下图所示:
<img src="https://huggingface.co/datasets/allenai/blog-images/resolve/main/tulu-v2/length_histogram_v2.png" alt="TuluV2长度直方图" width="600" style="margin-left:'auto' margin-right:'auto' display:'block'"/>
Tulu V1 Mix数据集可通过[此处](https://huggingface.co/datasets/allenai/tulu-v1)获取。
### 科学数据说明
本次发布的科学数据包含以下类别:
<img src="https://huggingface.co/datasets/allenai/blog-images/resolve/main/tulu-v2/science_data.png" alt="TuluV2科学数据混合分布" width="600" style="margin-left:'auto' margin-right:'auto' display:'block'"/>
请注意部分样本存在句子索引偏移错误,但该错误对模型性能的影响极小甚至可以忽略。我们在测试阶段发现了该问题,并将在后续版本中修复;数据集工件的详细发布也将在未来的版本中推出。
### 许可协议
本数据集采用[ODC-BY许可](https://opendatacommons.org/licenses/by/1-0/)进行发布。使用本数据集的同时,您还需遵守[Common Crawl使用条款](https://commoncrawl.org/terms-of-use/),以合规使用数据集中包含的相关内容。
提供机构:
maas
创建时间:
2023-11-27



