togethercomputer/llama-instruct
收藏Hugging Face2023-08-18 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/togethercomputer/llama-instruct
下载链接
链接失效反馈官方服务:
资源简介:
---
license: llama2
language:
- en
---
# llama-instruct
This dataset was used to finetune [Llama-2-7B-32K-Instruct](https://huggingface.co/togethercomputer/Llama-2-7B-32K-Instruct).
We follow the distillation paradigm that is used by [Alpaca](https://crfm.stanford.edu/2023/03/13/alpaca.html), [Vicuna](https://lmsys.org/blog/2023-03-30-vicuna/), [WizardLM](https://arxiv.org/abs/2304.12244), [Orca](https://www.microsoft.com/en-us/research/publication/orca-progressive-learning-from-complex-explanation-traces-of-gpt-4/)
— producing instructions by querying a powerful LLM, which in our case, is the [Llama-2-70B-Chat](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) model released by [Meta](https://ai.meta.com/llama/).
To build [Llama-2-7B-32K-Instruct](https://huggingface.co/togethercomputer/Llama-2-7B-32K-Instruct), we collect instructions from 19K human inputs extracted from [ShareGPT-90K](https://huggingface.co/datasets/philschmid/sharegpt-raw) (only using human inputs, not ChatGPT outputs).
The actual script handles multi-turn conversations and also supports restarting and caching via a SQLite3 database.
You can find the full script [here](https://github.com/togethercomputer/Llama-2-7B-32K-Instruct/blob/main/scripts/distill.py), with merely 122 lines!
The output of this step is a jsonl file, each line corresponding to one conversation:
```
{"text": "[INST] ... instruction ... [/INST] ... answer ... [INST] ... instruction ... [/INST] ..."}
{"text": "[INST] ... instruction ... [/INST] ... answer ... [INST] ... instruction ... [/INST] ..."}
{"text": "[INST] ... instruction ... [/INST] ... answer ... [INST] ... instruction ... [/INST] ..."}
```
For more details, please refer to the [Github repo](https://github.com/togethercomputer/Llama-2-7B-32K-Instruct).
## Languages
The language of the data is entirely English.
提供机构:
togethercomputer
原始信息汇总
llama-instruct 数据集概述
数据集用途
该数据集用于微调 Llama-2-7B-32K-Instruct 模型。
数据集构建方法
数据集遵循蒸馏范式,借鉴了 Alpaca、Vicuna、WizardLM、Orca 等项目的方法。通过查询强大的 LLM 生成指令,本案例中使用的是 Llama-2-70B-Chat 模型,由 Meta 发布。
数据来源
数据集收集了来自 ShareGPT-90K 的 19K 条人类输入(仅使用人类输入,不包括 ChatGPT 输出)。
数据处理
实际脚本处理多轮对话,并支持通过 SQLite3 数据库进行重启和缓存。完整脚本可在 这里 找到,仅包含 122 行代码。
数据格式
输出为 jsonl 文件,每行对应一个对话: json {"text": "[INST] ... instruction ... [/INST] ... answer ... [INST] ... instruction ... [/INST] ..."} {"text": "[INST] ... instruction ... [/INST] ... answer ... [INST] ... instruction ... [/INST] ..."} {"text": "[INST] ... instruction ... [/INST] ... answer ... [INST] ... instruction ... [/INST] ..."}
语言
数据集的语言完全为英语。



