five

Retail Banking QA Pairs for LLM Conversational Fine-Tuning

收藏
Databricks2024-07-06 收录
下载链接:
https://marketplace.databricks.com/details/62fe9896-c40d-4a68-a0be-517f50ca36a7/Bitext-Innovation-International_Retail-Banking-QA-Pairs-for-LLM-Conversational-Fine-Tuning
下载链接
链接失效反馈
官方服务:
资源简介:
**Overview** This hybrid synthetic dataset is designed to be used to fine-tune Large Language Models such as GPT, Mistral and OpenELM, and has been generated using our NLP/NLG technology and our automated Data Labeling (DAL) tools. The goal is to demonstrate how Verticalization/Domain Adaptation for the Retail Banking sector can be easily achieved using our two-step approach to LLM Fine-Tuning. For example, if you are [ACME Bank], you can create your own customized LLM by first training a fine-tuned model using this dataset, and then further fine-tuning it with a small amount of your own data. An overview of this approach can be found at: [From General-Purpose LLMs to Verticalized Enterprise Models](https://www.bitext.com/blog/general-purpose-models-verticalized-enterprise-genai/) **Use cases** - Training sophisticated AI models for intent detection and response generation in Retail Banking customer service applications. - Enhancing domain adaptation and fine-tuning of Large Language Models such as DBRX, GPT, Mistral, Llama3, Falcon, etc. with a diverse range of intents and contextual customer queries. - This represents step one of of our two-step approach to verticalizing GenAI for enterprise use, as we describe in https://www.bitext.com/blog/general-purpose-models-verticalized-enterprise-genai/ **Product details** This dataset is designed to train Large Language Models (LLMs) like GPT, Llama3, and Mistral, for Fine Tuning and Domain Adaptation in Retail Banking. - **Use Case**: Customer Service - **Vertical**: Retail Banking - **Intents**: 26 intents assigned to 9 categories - **Pairs**: 25,545 question/answer pairs - **Entities/Slots**: 1,224 entity/slot types - **Language Tags**: 12 types of language generation tags - **Token Count**: 4.98 million tokens in 'instruction' and 'response' **Categories and Intents** Covers a wide range of banking-related intents: - **ACCOUNT**: check_recent_transactions, close_account, create_account - **ATM**: dispute_ATM_withdrawal, recover_swallowed_card - **CARD**: activate_card, block_card, cancel_card, check_card_annual_fee - **CONTACT**: customer_service, human_agent - **FEES**: check_fees - **FIND**: find_ATM, find_branch - **LOAN**: apply_for_loan, apply_for_mortgage, cancel_loan - **PASSWORD**: get_password, set_up_password - **TRANSFER**: cancel_transfer, make_transfer **Fields of the Dataset** Each entry in the dataset comprises the following fields: - **flags**: Labels indicating language variations and styles. - **instruction**: The user's request or question from the retail banking domain. - **category**: The high-level semantic category for the intent. - **intent**: The specific purpose or action the user wants to achieve with the instruction. - **response**: The expected reply or action from the virtual assistant. **Language Generation Tags** Reflects various language variations and styles crucial for adaptable and responsive conversational AI models: - **M**: Morphological variation - **L**: Semantic variations - **B**: Basic syntactic structure - **I**: Interrogative structure - **C**: Coordinated syntactic structure - **N**: Negation - **P**: Politeness variation - **Q**: Colloquial variation - **W**: Offensive language - **K**: Keyword mode - **E**: Use of abbreviations - **Z**: Errors and Typos **License** Released under the **Community Data License Agreement (CDLA) Sharing 1.0**. This license ensures the freedom to use, share, modify, and utilize the data while maintaining attribution to the original data providers and sharing derivatives under the same license. For a detailed understanding of the license, refer to the [CDLA-Sharing 1.0 documentation](https://cdla.dev/sharing-1-0/).
提供机构:
Bitext Innovation International
搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
该数据集包含25,545个零售银行领域的问答对,覆盖26种意图和9个类别,旨在用于GPT、Mistral等大型语言模型的微调,提升其在客户服务场景中的意图识别和响应生成能力。数据集还包含1,224种实体/槽位类型和12种语言生成标签,总token数达498万。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作