ChatAlpaca

github2023-06-01 更新2024-05-31 收录

下载链接：

https://github.com/cascip/ChatAlpaca

下载链接

链接失效反馈

官方服务：

资源简介：

ChatAlpaca是一个旨在帮助研究人员开发遵循指令的多轮对话模型的聊天数据集。该数据集是对[Stanford Alpaca](https://github.com/tatsu-lab/stanford_alpaca)数据的扩展，包含多轮指令及其相应的响应。

ChatAlpaca is a chat dataset designed to help researchers develop instruction-following multi-turn dialogue models. It is an extension of the [Stanford Alpaca](https://github.com/tatsu-lab/stanford_alpaca) dataset, containing multi-turn instructions and their corresponding responses.

创建时间：

2023-04-06

原始信息汇总

ChatAlpaca: A Multi-Turn Dialogue Corpus based on Alpaca Instructions

数据集概述

ChatAlpaca 是一个多轮对话数据集，旨在帮助研究人员开发遵循指令的多轮对话模型。该数据集是基于 Stanford Alpaca 数据的扩展，包含多轮指令及其对应的响应。

数据内容

数据集包含一个 JSON 文件 data/chatalpaca-10k.json，其中包含多轮指令遵循对话。每个条目代表一个模拟用户和 ChatGPT 之间的不同对话。数据集目前包含总共 10,000 个对话和 95,558 个话语。

更新：我们发布了 data/chatalpaca-20k.json 数据和一个使用 Google 翻译的中文版本数据 data/chatalpaca-20k-chinese-google-translation.json。

每个对话以一个指令开始，随后是 ChatGPT 的响应，这些响应来自 Stanford Alpaca 数据。对话继续由用户提供进一步的指令，ChatGPT 相应地响应。

每个条目是一个字典列表，每个字典包含以下字段：

role: str，表示对话中的角色，可以是 "user" 或 "assistant"。
content: str，对话内容。

示例数据： json [ { "role": "user", "content": "Identify the odd one out: Twitter, Instagram, Telegram" }, { "role": "assistant", "content": "Telegram" }, { "role": "user", "content": "What makes Telegram different from Twitter and Instagram?" }, { "role": "assistant", "content": "Telegram is a cloud-based instant messaging app that focuses on privacy and security. Unlike Twitter and Instagram which are mainly used for following news, celebrities, and sharing images, Telegram was created as a secure messaging app for private and group communication. Telegram also offers more advanced features than Twitter and Instagram, such as the ability to schedule messages, create bots, and send encrypted messages." }, { "role": "user", "content": "Can you give me an example of how the scheduling messages feature can be useful on Telegram?" }, { "role": "assistant", "content": "Sure! The scheduling messages feature on Telegram can be useful in many different situations. Heres an example:

Lets say you have a workgroup on Telegram, and you want to send a message to everyone in the group to remind them of a meeting. However, you know that most of the people in the group are busy at work during the time you want to send the message.

Thats where the scheduling messages feature comes in. You can schedule the message to be sent at a later time when people are more likely to be free, such as during their lunch break or after work. This way, everyone in the group will receive the message at the same time, and you wont have to worry about anyone missing the important reminder.

Overall, the scheduling messages feature can be really helpful for coordinating communication with groups of people on Telegram, whether its for work, social events, or anything else." }, { "role": "user", "content": "Goodbye." } ]

数据生成过程

数据生成过程从原始的 Stanford Alpaca 数据开始。我们使用该数据集中的指令作为对话的起点。

用户话语生成

我们使用 ChatGPT (GPT-3.5-turbo) 生成后续话语并继续对话。为此，我们向模型提供设计好的提示和当前聊天历史，并要求 ChatGPT 生成模拟用户的话语，通过进一步提问或提供提示来继续对话。

响应生成

然后，我们使用生成的用户话语作为下一轮对话的起点。ChatGPT 的响应随后被附加到聊天历史中。

模型权重

我们发布了 Chatalpaca-7b-lora 和 Chatalpaca-7B-hf 模型权重，这些模型是基于 20k 数据训练的。

微调

如果你想用自己的数据微调 LLaMA，请遵循 ./data/chatalpaca-20k.json 的格式并运行微调脚本。

代码和超参数

我们的代码基于 Fastchat 和 alpaca-lora，支持多轮聊天数据。我们使用与 Fastchat 和 alpaca-lora 类似的超参数。

全量微调

超参数	全局批量大小	学习率	训练轮数	最大长度	权重衰减
Chatalpaca-7b-hf	128	2e-5	6	2048	0

LoRA 微调

超参数	全局批量大小	学习率	训练轮数	最大长度	Lora_target_modules	Lora_r
Chatalpaca-7b-lora	64	3e-4	3	2048	[q_proj,k_proj,v_proj,o_proj]	16

部署和推理

使用以下命令将 Chatalpaca-20k 模型部署到本地服务器。 bash python utils/deploy.py --model_path /path/to/your-model --prompt_style vicuna-hf

部署后，你可以访问 localhost:7860 与模型对话。

TODO

[x] 发布 10k 数据
[x] 发布 20k 数据
[x] 发布中文翻译版本数据
[x] 发布 LLaMA-7B-LoRA 模型
[x] 发布 LLaMA-7B 微调模型

引用

请在使用该数据集时引用此仓库。

@misc{ChatAlpaca, author = {Ning Bian and Hongyu Lin and Yaojie Lu and Xianpei Han and Le Sun and Ben He }, title = {ChatAlpaca: A Multi-Turn Dialogue Corpus based on Alpaca Instructions}, year = {2023}, publisher = {GitHub}, journal = {GitHub repository}, howpublished = {url{https://github.com/cascip/ChatAlpaca}}, }

搜集汇总

数据集介绍

构建方式

ChatAlpaca数据集的构建基于Stanford Alpaca数据集，通过ChatGPT（GPT-3.5-turbo）生成多轮对话。初始对话由Stanford Alpaca中的指令和响应构成，随后通过设计提示词和当前对话历史，引导ChatGPT生成模拟用户的后续话语。为确保对话的自然性，生成过程中过滤了类似AI助手的表达，并通过多轮迭代生成对话内容，直至达到预设的对话轮数或用户表达结束意图。最终数据集包含10,000条对话，共计95,558条话语。

使用方法

ChatAlpaca数据集的使用方法包括模型微调和推理部署。用户可通过提供的脚本将数据集用于LLaMA-7B模型的微调，支持全参数微调和LoRA微调两种方式。微调过程中，用户需指定数据路径、模型路径及超参数，如学习率、批量大小等。推理部署时，用户可通过部署脚本将微调后的模型加载至本地服务器，并通过Web界面进行对话测试。数据集的使用文档详细说明了每一步的操作流程，便于研究人员快速上手。

背景与挑战

背景概述

ChatAlpaca数据集由中国科学院软件研究所中文信息处理实验室开发，旨在为多轮对话中的指令跟随任务提供支持。该数据集基于斯坦福大学的Alpaca数据集扩展而来，通过ChatGPT（GPT-3.5-turbo）生成多轮对话内容，模拟用户与AI助手之间的交互。ChatAlpaca的构建不仅延续了Alpaca在指令生成方面的优势，还进一步丰富了对话的多样性和复杂性，为自然语言处理领域的研究提供了重要的数据资源。该数据集的发布标志着多轮对话系统研究的一个重要里程碑，推动了指令跟随模型的发展。

当前挑战

ChatAlpaca数据集在构建过程中面临多重挑战。首先，生成高质量的多轮对话需要确保对话的连贯性和逻辑性，这对生成模型的上下文理解能力提出了较高要求。其次，如何避免生成内容中的重复性和冗余性，同时保持对话的多样性和深度，是数据生成过程中的一大难题。此外，数据集的构建还需要解决用户指令与AI响应之间的语义对齐问题，以确保生成的对话能够准确反映真实场景中的交互模式。最后，数据集的扩展和翻译工作也带来了额外的技术挑战，尤其是在保持语义一致性和文化适应性方面。这些挑战共同构成了ChatAlpaca数据集构建的核心难点。

常用场景

经典使用场景

ChatAlpaca数据集在多轮对话模型的研究与开发中扮演着重要角色。该数据集通过模拟用户与ChatGPT的多轮对话，提供了丰富的指令遵循对话场景，特别适用于训练和评估对话系统在复杂指令下的响应能力。研究人员可以利用该数据集来优化模型在多轮对话中的连贯性和上下文理解能力。

解决学术问题

ChatAlpaca数据集解决了多轮对话系统中指令理解和上下文连贯性的关键问题。通过提供大量多轮对话数据，该数据集为研究如何使模型更好地理解用户意图、保持对话逻辑一致性提供了重要支持。其生成的多轮对话场景能够帮助模型在复杂指令下进行更准确的响应，推动了对话系统在自然语言处理领域的进一步发展。

实际应用

在实际应用中，ChatAlpaca数据集为开发智能客服、虚拟助手等对话系统提供了高质量的训练数据。基于该数据集训练的模型能够更好地处理用户的多轮指令，提升对话系统的实用性和用户体验。例如，在智能客服场景中，模型可以更准确地理解用户需求并提供连贯的解决方案，从而减少人工干预的需求。

数据集最近研究