Retail Banking QA Pairs for LLM Conversational Fine-Tuning
收藏Databricks2024-07-06 收录
下载链接:
https://marketplace.databricks.com/details/62fe9896-c40d-4a68-a0be-517f50ca36a7/Bitext-Innovation-International_Retail-Banking-QA-Pairs-for-LLM-Conversational-Fine-Tuning
下载链接
链接失效反馈官方服务:
资源简介:
**Overview**
This hybrid synthetic dataset is designed to be used to fine-tune Large Language Models such as GPT, Mistral and OpenELM, and has been generated using our NLP/NLG technology and our automated Data Labeling (DAL) tools. The goal is to demonstrate how Verticalization/Domain Adaptation for the Retail Banking sector can be easily achieved using our two-step approach to LLM Fine-Tuning. For example, if you are [ACME Bank], you can create your own customized LLM by first training a fine-tuned model using this dataset, and then further fine-tuning it with a small amount of your own data. An overview of this approach can be found at: [From General-Purpose LLMs to Verticalized Enterprise Models](https://www.bitext.com/blog/general-purpose-models-verticalized-enterprise-genai/)
**Use cases**
- Training sophisticated AI models for intent detection and response generation in Retail Banking customer service applications.
- Enhancing domain adaptation and fine-tuning of Large Language Models such as DBRX, GPT, Mistral, Llama3, Falcon, etc. with a diverse range of intents and contextual customer queries.
- This represents step one of of our two-step approach to verticalizing GenAI for enterprise use, as we describe in https://www.bitext.com/blog/general-purpose-models-verticalized-enterprise-genai/
**Product details**
This dataset is designed to train Large Language Models (LLMs) like GPT, Llama3, and Mistral, for Fine Tuning and Domain Adaptation in Retail Banking.
- **Use Case**: Customer Service
- **Vertical**: Retail Banking
- **Intents**: 26 intents assigned to 9 categories
- **Pairs**: 25,545 question/answer pairs
- **Entities/Slots**: 1,224 entity/slot types
- **Language Tags**: 12 types of language generation tags
- **Token Count**: 4.98 million tokens in 'instruction' and 'response'
**Categories and Intents**
Covers a wide range of banking-related intents:
- **ACCOUNT**: check_recent_transactions, close_account, create_account
- **ATM**: dispute_ATM_withdrawal, recover_swallowed_card
- **CARD**: activate_card, block_card, cancel_card, check_card_annual_fee
- **CONTACT**: customer_service, human_agent
- **FEES**: check_fees
- **FIND**: find_ATM, find_branch
- **LOAN**: apply_for_loan, apply_for_mortgage, cancel_loan
- **PASSWORD**: get_password, set_up_password
- **TRANSFER**: cancel_transfer, make_transfer
**Fields of the Dataset**
Each entry in the dataset comprises the following fields:
- **flags**: Labels indicating language variations and styles.
- **instruction**: The user's request or question from the retail banking domain.
- **category**: The high-level semantic category for the intent.
- **intent**: The specific purpose or action the user wants to achieve with the instruction.
- **response**: The expected reply or action from the virtual assistant.
**Language Generation Tags**
Reflects various language variations and styles crucial for adaptable and responsive conversational AI models:
- **M**: Morphological variation
- **L**: Semantic variations
- **B**: Basic syntactic structure
- **I**: Interrogative structure
- **C**: Coordinated syntactic structure
- **N**: Negation
- **P**: Politeness variation
- **Q**: Colloquial variation
- **W**: Offensive language
- **K**: Keyword mode
- **E**: Use of abbreviations
- **Z**: Errors and Typos
**License**
Released under the **Community Data License Agreement (CDLA) Sharing 1.0**. This license ensures the freedom to use, share, modify, and utilize the data while maintaining attribution to the original data providers and sharing derivatives under the same license.
For a detailed understanding of the license, refer to the [CDLA-Sharing 1.0 documentation](https://cdla.dev/sharing-1-0/).
提供机构:
Bitext Innovation International
搜集汇总
数据集介绍

背景与挑战
背景概述
该数据集包含25,545个零售银行领域的问答对,覆盖26种意图和9个类别,旨在用于GPT、Mistral等大型语言模型的微调,提升其在客户服务场景中的意图识别和响应生成能力。数据集还包含1,224种实体/槽位类型和12种语言生成标签,总token数达498万。
以上内容由遇见数据集搜集并总结生成



