five

togethercomputer/llama-instruct

收藏
Hugging Face2023-08-18 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/togethercomputer/llama-instruct
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: llama2 language: - en --- # llama-instruct This dataset was used to finetune [Llama-2-7B-32K-Instruct](https://huggingface.co/togethercomputer/Llama-2-7B-32K-Instruct). We follow the distillation paradigm that is used by [Alpaca](https://crfm.stanford.edu/2023/03/13/alpaca.html), [Vicuna](https://lmsys.org/blog/2023-03-30-vicuna/), [WizardLM](https://arxiv.org/abs/2304.12244), [Orca](https://www.microsoft.com/en-us/research/publication/orca-progressive-learning-from-complex-explanation-traces-of-gpt-4/) — producing instructions by querying a powerful LLM, which in our case, is the [Llama-2-70B-Chat](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) model released by [Meta](https://ai.meta.com/llama/). To build [Llama-2-7B-32K-Instruct](https://huggingface.co/togethercomputer/Llama-2-7B-32K-Instruct), we collect instructions from 19K human inputs extracted from [ShareGPT-90K](https://huggingface.co/datasets/philschmid/sharegpt-raw) (only using human inputs, not ChatGPT outputs). The actual script handles multi-turn conversations and also supports restarting and caching via a SQLite3 database. You can find the full script [here](https://github.com/togethercomputer/Llama-2-7B-32K-Instruct/blob/main/scripts/distill.py), with merely 122 lines! The output of this step is a jsonl file, each line corresponding to one conversation: ``` {"text": "[INST] ... instruction ... [/INST] ... answer ... [INST] ... instruction ... [/INST] ..."} {"text": "[INST] ... instruction ... [/INST] ... answer ... [INST] ... instruction ... [/INST] ..."} {"text": "[INST] ... instruction ... [/INST] ... answer ... [INST] ... instruction ... [/INST] ..."} ``` For more details, please refer to the [Github repo](https://github.com/togethercomputer/Llama-2-7B-32K-Instruct). ## Languages The language of the data is entirely English.
提供机构:
togethercomputer
原始信息汇总

llama-instruct 数据集概述

数据集用途

该数据集用于微调 Llama-2-7B-32K-Instruct 模型。

数据集构建方法

数据集遵循蒸馏范式,借鉴了 AlpacaVicunaWizardLMOrca 等项目的方法。通过查询强大的 LLM 生成指令,本案例中使用的是 Llama-2-70B-Chat 模型,由 Meta 发布。

数据来源

数据集收集了来自 ShareGPT-90K 的 19K 条人类输入(仅使用人类输入,不包括 ChatGPT 输出)。

数据处理

实际脚本处理多轮对话,并支持通过 SQLite3 数据库进行重启和缓存。完整脚本可在 这里 找到,仅包含 122 行代码。

数据格式

输出为 jsonl 文件,每行对应一个对话: json {"text": "[INST] ... instruction ... [/INST] ... answer ... [INST] ... instruction ... [/INST] ..."} {"text": "[INST] ... instruction ... [/INST] ... answer ... [INST] ... instruction ... [/INST] ..."} {"text": "[INST] ... instruction ... [/INST] ... answer ... [INST] ... instruction ... [/INST] ..."}

语言

数据集的语言完全为英语。

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作