Customer Service QA Pairs for LLM Conversational Training

Name: Customer Service QA Pairs for LLM Conversational Training
Creator: Bitext Innovation International
License: 暂无描述

Databricks2024-05-09 收录

下载链接：

https://marketplace.databricks.com/details/2ef1f9e3-d95d-491a-9251-dea46285b410/Bitext-Innovation-International_Customer-Service-QA-Pairs-for-LLM-Conversational-Training

下载链接

链接失效反馈

官方服务：

资源简介：

**Overview** This hybrid synthetic dataset is designed to be used to fine-tune Large Language Models such as GPT, Mistral and OpenELM, and has been generated using our NLP/NLG technology and our automated Data Labeling (DAL) tools. The goal is to demonstrate how Verticalization/Domain Adaptation for the Customer Support sector can be easily achieved using our two-step approach to LLM Fine-Tuning. For example, if you are [ACME Company], you can create your own customized LLM by first training a fine-tuned model using this dataset, and then further fine-tuning it with a small amount of your own data. An overview of this approach can be found at: [From General-Purpose LLMs to Verticalized Enterprise Models](https://www.bitext.com/blog/general-purpose-models-verticalized-enterprise-genai/) **Use cases** - Training sophisticated AI models for intent detection and response generation in customer service applications. - Enhancing domain adaptation and fine-tuning of Large Language Models such as DBRX, GPT, Mistral, Llama3, Falcon, etc. with a diverse range of intents and contextual customer queries. **Product details** This rich dataset encompasses: - **Categories:** Spanning across ACCOUNT, CANCELLATION_FEE, CONTACT, DELIVERY, and more, facilitating nuanced model training. - **Intents:** Features 27 distinct intents like create_account, check_cancellation_fee, and track_order, among others. - **Questions/Answers:** A collection of 26,872 pairs with an average of 1,000 per intent. - **Entities/Slots:** Includes 30 types such as {{Order Number}}, {{Invoice Number}}, and {{Customer Support Email}}. - **Language Generation Tags:** Contains 12 types of tags for morphological, semantic, syntactic, and register variations, aiding in the creation of dialogues that mimic real-life conversational patterns. For an immersive look into the dataset structure and content, refer to the embedded notebook which showcases sample queries and responses along with detailed instructions for use. **Additional Insights** - The dataset has been meticulously curated by computational linguists, ensuring quality and relevance. - Comprehensive tagging allows for dataset customization based on linguistic phenomena, catering to various user profiles and conversational styles. - Visit [Bitext's Vertical-Specific Datasets](https://www.bitext.com/chatbot-verticals/) for an in-depth understanding of our vertical coverage and intents.

## 数据集概览本混合合成数据集专为GPT、Mistral及OpenELM等大语言模型（Large Language Model，LLM）的微调任务设计，通过我们的自然语言处理（Natural Language Processing，NLP）/自然语言生成（Natural Language Generation，NLG）技术与自动化数据标注（Data Labeling，DAL）工具生成。其核心目标是展示如何借助我们提出的大语言模型微调两步法，轻松实现客户支持领域的大语言模型垂直化/领域适配。例如，若您隶属于[ACME公司]，可先通过本数据集训练微调模型，再使用少量自有数据对其进行进一步微调，从而打造专属定制化大语言模型。该方法的详细概述可参阅：[From General-Purpose LLMs to Verticalized Enterprise Models](https://www.bitext.com/blog/general-purpose-models-verticalized-enterprise-genai/) ## 应用场景 - 训练适用于客服场景的意图识别与回复生成的高精度AI模型 - 针对DBRX、GPT、Mistral、Llama3、Falcon等大语言模型，通过多样化意图与上下文客户查询实现领域适配与微调优化 ## 产品详情本丰富数据集包含以下内容： - **类别**：覆盖账户（ACCOUNT）、取消费用（CANCELLATION_FEE）、联系方式（CONTACT）、配送（DELIVERY）等多个类别，便于开展精细化模型训练 - **意图**：包含27种不同意图，例如创建账户（create_account）、查询取消费用（check_cancellation_fee）、追踪订单（track_order）等 - **问答对**：共计26872组，平均每种意图对应约1000组问答 - **实体/槽位**：涵盖30类实体，例如{{订单编号}}、{{发票编号}}、{{客户支持邮箱}} - **语言生成标签**：包含12类标签，用于覆盖形态、语义、句法及语域变体，助力生成贴合真实对话模式的交互文本如需深入了解数据集结构与内容，请参阅内嵌的演示笔记，其中包含示例查询与回复及详细使用说明。 ## 额外洞察 - 本数据集由计算语言学家精心编撰，确保内容质量与领域相关性 - 全面的标签体系支持基于语言现象的数据集定制，可适配不同用户画像与对话风格 - 访问[Bitext垂直领域专属数据集](https://www.bitext.com/chatbot-verticals/)可深入了解我们的垂直领域覆盖范围与意图类型

提供机构：

Bitext Innovation International

搜集汇总

数据集介绍

背景与挑战

背景概述

该数据集是一个混合合成的客户服务问答对集合，专用于微调GPT、Mistral等大语言模型，以提升其在客户支持领域的垂直化适应能力。它包含27种意图、近2.7万个问答对、30种实体类型和12种语言生成标签，支持模型在意图检测和响应生成方面的精细化训练。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集