togethercomputer/llama-instruct

Name: togethercomputer/llama-instruct
Creator: togethercomputer
Published: 2023-08-18 05:04:06
License: 暂无描述

Hugging Face2023-08-18 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/togethercomputer/llama-instruct

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: llama2 language: - en --- # llama-instruct This dataset was used to finetune [Llama-2-7B-32K-Instruct](https://huggingface.co/togethercomputer/Llama-2-7B-32K-Instruct). We follow the distillation paradigm that is used by [Alpaca](https://crfm.stanford.edu/2023/03/13/alpaca.html), [Vicuna](https://lmsys.org/blog/2023-03-30-vicuna/), [WizardLM](https://arxiv.org/abs/2304.12244), [Orca](https://www.microsoft.com/en-us/research/publication/orca-progressive-learning-from-complex-explanation-traces-of-gpt-4/) — producing instructions by querying a powerful LLM, which in our case, is the [Llama-2-70B-Chat](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) model released by [Meta](https://ai.meta.com/llama/). To build [Llama-2-7B-32K-Instruct](https://huggingface.co/togethercomputer/Llama-2-7B-32K-Instruct), we collect instructions from 19K human inputs extracted from [ShareGPT-90K](https://huggingface.co/datasets/philschmid/sharegpt-raw) (only using human inputs, not ChatGPT outputs). The actual script handles multi-turn conversations and also supports restarting and caching via a SQLite3 database. You can find the full script [here](https://github.com/togethercomputer/Llama-2-7B-32K-Instruct/blob/main/scripts/distill.py), with merely 122 lines! The output of this step is a jsonl file, each line corresponding to one conversation: ``` {"text": "[INST] ... instruction ... [/INST] ... answer ... [INST] ... instruction ... [/INST] ..."} {"text": "[INST] ... instruction ... [/INST] ... answer ... [INST] ... instruction ... [/INST] ..."} {"text": "[INST] ... instruction ... [/INST] ... answer ... [INST] ... instruction ... [/INST] ..."} ``` For more details, please refer to the [Github repo](https://github.com/togethercomputer/Llama-2-7B-32K-Instruct). ## Languages The language of the data is entirely English.

提供机构：

togethercomputer

原始信息汇总

llama-instruct 数据集概述

数据集用途

该数据集用于微调 Llama-2-7B-32K-Instruct 模型。

数据集构建方法

数据集遵循蒸馏范式，借鉴了 Alpaca、Vicuna、WizardLM、Orca 等项目的方法。通过查询强大的 LLM 生成指令，本案例中使用的是 Llama-2-70B-Chat 模型，由 Meta 发布。

数据来源

数据集收集了来自 ShareGPT-90K 的 19K 条人类输入（仅使用人类输入，不包括 ChatGPT 输出）。

数据处理

实际脚本处理多轮对话，并支持通过 SQLite3 数据库进行重启和缓存。完整脚本可在这里找到，仅包含 122 行代码。

数据格式

输出为 jsonl 文件，每行对应一个对话： json {"text": "[INST] ... instruction ... [/INST] ... answer ... [INST] ... instruction ... [/INST] ..."} {"text": "[INST] ... instruction ... [/INST] ... answer ... [INST] ... instruction ... [/INST] ..."} {"text": "[INST] ... instruction ... [/INST] ... answer ... [INST] ... instruction ... [/INST] ..."}

语言

数据集的语言完全为英语。

5,000+

优质数据集

54 个

任务类型

进入经典数据集