Customer Service QA Pairs for LLM Conversational Training
收藏Databricks2024-05-09 收录
下载链接:
https://marketplace.databricks.com/details/2ef1f9e3-d95d-491a-9251-dea46285b410/Bitext-Innovation-International_Customer-Service-QA-Pairs-for-LLM-Conversational-Training
下载链接
链接失效反馈官方服务:
资源简介:
**Overview**
This hybrid synthetic dataset is designed to be used to fine-tune Large Language Models such as GPT, Mistral and OpenELM, and has been generated using our NLP/NLG technology and our automated Data Labeling (DAL) tools. The goal is to demonstrate how Verticalization/Domain Adaptation for the Customer Support sector can be easily achieved using our two-step approach to LLM Fine-Tuning. For example, if you are [ACME Company], you can create your own customized LLM by first training a fine-tuned model using this dataset, and then further fine-tuning it with a small amount of your own data. An overview of this approach can be found at: [From General-Purpose LLMs to Verticalized Enterprise Models](https://www.bitext.com/blog/general-purpose-models-verticalized-enterprise-genai/)
**Use cases**
- Training sophisticated AI models for intent detection and response generation in customer service applications.
- Enhancing domain adaptation and fine-tuning of Large Language Models such as DBRX, GPT, Mistral, Llama3, Falcon, etc. with a diverse range of intents and contextual customer queries.
**Product details**
This rich dataset encompasses:
- **Categories:** Spanning across ACCOUNT, CANCELLATION_FEE, CONTACT, DELIVERY, and more, facilitating nuanced model training.
- **Intents:** Features 27 distinct intents like create_account, check_cancellation_fee, and track_order, among others.
- **Questions/Answers:** A collection of 26,872 pairs with an average of 1,000 per intent.
- **Entities/Slots:** Includes 30 types such as {{Order Number}}, {{Invoice Number}}, and {{Customer Support Email}}.
- **Language Generation Tags:** Contains 12 types of tags for morphological, semantic, syntactic, and register variations, aiding in the creation of dialogues that mimic real-life conversational patterns.
For an immersive look into the dataset structure and content, refer to the embedded notebook which showcases sample queries and responses along with detailed instructions for use.
**Additional Insights**
- The dataset has been meticulously curated by computational linguists, ensuring quality and relevance.
- Comprehensive tagging allows for dataset customization based on linguistic phenomena, catering to various user profiles and conversational styles.
- Visit [Bitext's Vertical-Specific Datasets](https://www.bitext.com/chatbot-verticals/) for an in-depth understanding of our vertical coverage and intents.
## 数据集概览
本混合合成数据集专为GPT、Mistral及OpenELM等大语言模型(Large Language Model,LLM)的微调任务设计,通过我们的自然语言处理(Natural Language Processing,NLP)/自然语言生成(Natural Language Generation,NLG)技术与自动化数据标注(Data Labeling,DAL)工具生成。其核心目标是展示如何借助我们提出的大语言模型微调两步法,轻松实现客户支持领域的大语言模型垂直化/领域适配。例如,若您隶属于[ACME公司],可先通过本数据集训练微调模型,再使用少量自有数据对其进行进一步微调,从而打造专属定制化大语言模型。该方法的详细概述可参阅:[From General-Purpose LLMs to Verticalized Enterprise Models](https://www.bitext.com/blog/general-purpose-models-verticalized-enterprise-genai/)
## 应用场景
- 训练适用于客服场景的意图识别与回复生成的高精度AI模型
- 针对DBRX、GPT、Mistral、Llama3、Falcon等大语言模型,通过多样化意图与上下文客户查询实现领域适配与微调优化
## 产品详情
本丰富数据集包含以下内容:
- **类别**:覆盖账户(ACCOUNT)、取消费用(CANCELLATION_FEE)、联系方式(CONTACT)、配送(DELIVERY)等多个类别,便于开展精细化模型训练
- **意图**:包含27种不同意图,例如创建账户(create_account)、查询取消费用(check_cancellation_fee)、追踪订单(track_order)等
- **问答对**:共计26872组,平均每种意图对应约1000组问答
- **实体/槽位**:涵盖30类实体,例如{{订单编号}}、{{发票编号}}、{{客户支持邮箱}}
- **语言生成标签**:包含12类标签,用于覆盖形态、语义、句法及语域变体,助力生成贴合真实对话模式的交互文本
如需深入了解数据集结构与内容,请参阅内嵌的演示笔记,其中包含示例查询与回复及详细使用说明。
## 额外洞察
- 本数据集由计算语言学家精心编撰,确保内容质量与领域相关性
- 全面的标签体系支持基于语言现象的数据集定制,可适配不同用户画像与对话风格
- 访问[Bitext垂直领域专属数据集](https://www.bitext.com/chatbot-verticals/)可深入了解我们的垂直领域覆盖范围与意图类型
提供机构:
Bitext Innovation International
搜集汇总
数据集介绍

背景与挑战
背景概述
该数据集是一个混合合成的客户服务问答对集合,专用于微调GPT、Mistral等大语言模型,以提升其在客户支持领域的垂直化适应能力。它包含27种意图、近2.7万个问答对、30种实体类型和12种语言生成标签,支持模型在意图检测和响应生成方面的精细化训练。
以上内容由遇见数据集搜集并总结生成



