Travel QA Pairs for LLM Conversational Fine-Tuning

Name: Travel QA Pairs for LLM Conversational Fine-Tuning
Creator: Bitext Innovation International
License: 暂无描述

Databricks2024-09-29 收录

下载链接：

https://marketplace.databricks.com/details/b5752e30-9089-4a86-a95f-97726b73d9d7/Bitext-Innovation-International_Travel-QA-Pairs-for-LLM-Conversational-Fine-Tuning

下载链接

链接失效反馈

官方服务：

资源简介：

**Overview** This hybrid synthetic dataset is designed to be used to fine-tune Large Language Models such as GPT, Mistral and OpenELM, and has been generated using our NLP/NLG technology and our automated Data Labeling (DAL) tools. The goal is to demonstrate how Verticalization/Domain Adaptation for the [Travel] sector can be easily achieved using our two-step approach to LLM Fine-Tuning. An overview of this approach can be found at: [From General-Purpose LLMs to Verticalized Enterprise Models](https://www.bitext.com/blog/general-purpose-models-verticalized-enterprise-genai/) The dataset has the following specifications: - Use Case: Intent Detection - Vertical: Travel - 33 intents assigned to 11 categories - 31658 question/answer pairs, with approximately 1000 per intent - 72 entity/slot types - 10 different types of language generation tags The categories and intents are derived from Bitext's extensive experience across various industry-specific datasets, ensuring the relevance and applicability across diverse contexts. **Dataset Token Count** The dataset contains a total of 4.16 million tokens across 'instruction' and 'response' columns. This extensive corpus is crucial for training sophisticated LLMs that can perform a variety of functions including conversational AI, question answering, and virtual assistant tasks in the travel domain. **Fields of the Dataset** Each entry in the dataset comprises the following fields: - tags - instruction: a user request from the Travel domain - category: the high-level semantic category for the intent - intent: the specific intent corresponding to the user instruction - response: an example of an expected response from the virtual assistant **Categories and Intents** The dataset covers a wide range of travel-related categories and intents, which are: - **BAGGAGE**: check_baggage_allowance - **BOARDING_PASS**: get_boarding_pass, print_boarding_pass - **CANCELLATION_FEE**: check_cancellation_fee - **CHECK_IN**: check_in - **CONTACT**: human_agent - **FLIGHT**: book_flight, cancel_flight, change_flight, check_flight_insurance_coverage, check_flight_offers, check_flight_prices, check_flight_reservation, check_flight_status, purchase_flight_insurance, search_flight, search_flight_insurance - **PRICES**: check_trip_prices - **REFUND**: get_refund - **SEAT**: change_seat, choose_seat - **TIME**: check_arrival_time, check_departure_time - **TRIP**: book_trip, cancel_trip, change_trip, check_trip_details, check_trip_insurance_coverage, check_trip_offers, check_trip_plan, check_trip_prices, purchase_trip_insurance, search_trip, search_trip_insurance **Entities** The entities covered by the dataset include: - **{{WEBSITE_URL}}**, common with most intents. - **{{APP_NAME}}**, featured in intents like change_flight, check_arrival_time. - **{{CUSTOMER_SUPPORT}}**, associated with intents like check_flight_reservation, check_trip_insurance_coverage, check_trip_plan. - **{{ORIGIN_CITY}}**, relevant to intents such as book_flight, change_flight. - **{{DESTINATION_CITY}}**, featured in intents like book_flight, book_trip. This comprehensive list of entities ensures that the dataset is well-equipped to train models that are highly adept at understanding and processing a wide range of travel-related queries and tasks. **Language Generation Tags** The dataset includes tags indicative of various language variations and styles adapted for Travel, enhancing the robustness and versatility of models trained on this data. These tags categorize the utterances into different registers such as colloquial, formal, or containing specific travel jargon, ensuring that the trained models can understand and generate a range of conversational styles appropriate for different customer interactions in the travel sector. **Language Generation Tags** The dataset includes tags that reflect various language variations and styles, crucial for creating adaptable and responsive conversational AI models within the travel sector. These tags help in understanding and generating appropriate responses based on the linguistic context and user interaction style. **Tags for Lexical variation** - **M - Morphological variation**: Adjusts for inflectional and derivational forms. - Example: "is my account active", "is my account activated" - **L - Semantic variations**: Handles synonyms, use of hyphens, and compounding. - Example: “what's my balance date", “what's my billing date” **Tags for Syntactic structure variation** - **B - Basic syntactic structure**: Simple, direct commands or statements. - Example: "activate my card", "I need to check my balance" - **I - Interrogative structure**: Structuring sentences in the form of questions. - Example: “can you show my balance?”, “how do I transfer money?” - **C - Coordinated syntactic structure**: Complex sentences coordinating multiple ideas or tasks. - Example: “I want to transfer money and check my balance, what should I do?” - **N - Negation**: Expressing denial or contradiction. - Example: "I do not wish to proceed with this transaction, how can I stop it?" **Tags for language register variations** - **P - Politeness variation**: Polite forms often used in customer service. - Example: “could you please help me check my account balance?” - **Q - Colloquial variation**: Informal language that might be used in casual customer interactions. - Example: "can u tell me my balance?" - **W - Offensive language**: Handling potentially offensive language which might occasionally appear in frustrated customer interactions. - Example: “I’m upset with these charges, this is ridiculous!” **Tags for stylistic variations** - **K - Keyword mode**: Responses focused on keywords. - Example: "balance check", "account status" - **E - Use of abbreviations**: Common abbreviations. - Example: “acct for account”, “trans for transaction” - **Z - Errors and Typos**: Includes common misspellings or typographical errors found in customer inputs. - Example: “how can I chek my balance” **Other tags not in use in this Dataset** - **D - Indirect speech**: Expressing commands or requests indirectly. - Example: “I was wondering if you could show me my last transaction.” - **G - Regional variations**: Adjustments for regional language differences. - Example: American vs British English: "checking account" vs "current account" - **R - Respect structures - Language-dependent variations**: Formality levels appropriate in different languages. - Example: Using “vous” in French for formal addressing instead of “tu.” - **Y - Code switching**: Switching between languages or dialects within the same conversation. - Example: “Can you help me with my cuenta, please?” These tags not only aid in training models for a wide range of customer interactions but also ensure that the models are culturally and linguistically sensitive, enhancing the customer experience in travel environments. **License** The `Bitext-travel-llm-chatbot-training-dataset` is released under the **Community Data License Agreement (CDLA) Sharing 1.0**. This license facilitates broad sharing and collaboration while ensuring that the freedom to use, share, modify, and utilize the data remains intact for all users. **Key Aspects of CDLA-Sharing 1.0** - **Attribution and ShareAlike**: Users must attribute the dataset and continue to share derivatives under the same license. - **Non-Exclusivity**: The license is non-exclusive, allowing multiple users to utilize the data simultaneously. - **Irrevocability**: Except in cases of material non-compliance, rights under this license are irrevocable. - **No Warranty**: The dataset is provided without warranties regarding its accuracy, completeness, or fitness for a particular purpose. - **Limitation of Liability**: Both users and data providers limit their liability for damages arising from the use of the dataset. **Usage Under CDLA-Sharing 1.0** By using the `Bitext-travel-llm-chatbot-training-dataset`, you agree to adhere to the terms set forth in the CDLA-Sharing 1.0. It is essential to ensure that any publications or distributions of the data, or derivatives thereof, maintain attribution to the original data providers and are distributed under the same or compatible terms of this agreement. For a detailed understanding of the license, refer to the [official CDLA-Sharing 1.0 documentation](https://cdla.dev/sharing-1-0/). This license supports the open sharing and collaborative improvement of datasets within the AI and data science community, making it particularly suited for projects aimed at developing and enhancing AI technologies in the travel sector. --- (c) Bitext Innovations, 2024

**概述** 本混合合成数据集旨在用于微调大语言模型（Large Language Model，LLM），例如GPT、Mistral与OpenELM，其生成依托我方自然语言处理（Natural Language Processing，NLP）、自然语言生成（Natural Language Generation，NLG）技术以及自动化数据标注（Data Labeling，DAL）工具。本数据集旨在演示如何通过我方的大语言模型微调两步法，轻松实现[旅游]领域的垂直化/领域适配。该方法的概述可参阅：[从通用大语言模型到垂直化企业模型](https://www.bitext.com/blog/general-purpose-models-verticalized-enterprise-genai/) 本数据集具备以下规格： - 用例：意图识别 - 领域：旅游 - 11个分类共包含33个意图 - 31658条问答对，单意图平均约1000条 - 72种实体/槽位类型 - 10种语言生成标签类型本数据集的分类与意图均基于Bitext在多行业专属数据集领域的丰富经验构建，确保其在多样场景下的相关性与适用性。 **数据集Token数量** 本数据集的"instruction"（指令）与"response"（回复）字段总计包含416万Token。该大规模语料库对于训练能够在旅游领域执行对话式AI、问答以及虚拟助手等多种任务的高性能大语言模型至关重要。 **数据集字段** 数据集中的每条记录均包含以下字段： - tags：标签 - instruction：旅游领域的用户请求 - category：对应意图的高层语义分类 - intent：与用户指令对应的具体意图 - response：虚拟助手的预期回复示例 **分类与意图** 本数据集涵盖多类旅游相关分类与意图，具体如下： - **BAGGAGE**：check_baggage_allowance - **BOARDING_PASS**：get_boarding_pass、print_boarding_pass - **CANCELLATION_FEE**：check_cancellation_fee - **CHECK_IN**：check_in - **CONTACT**：human_agent - **FLIGHT**：book_flight、cancel_flight、change_flight、check_flight_insurance_coverage、check_flight_offers、check_flight_prices、check_flight_reservation、check_flight_status、purchase_flight_insurance、search_flight、search_flight_insurance - **PRICES**：check_trip_prices - **REFUND**：get_refund - **SEAT**：change_seat、choose_seat - **TIME**：check_arrival_time、check_departure_time - **TRIP**：book_trip、cancel_trip、change_trip、check_trip_details、check_trip_insurance_coverage、check_trip_offers、check_trip_plan、check_trip_prices、purchase_trip_insurance、search_trip、search_trip_insurance **实体** 本数据集涵盖的实体如下： - **{{WEBSITE_URL}}**：适用于多数意图 - **{{APP_NAME}}**：出现在change_flight、check_arrival_time等意图中 - **{{CUSTOMER_SUPPORT}}**：关联于check_flight_reservation、check_trip_insurance_coverage、check_trip_plan等意图 - **{{ORIGIN_CITY}}**：适用于book_flight、change_flight等意图 - **{{DESTINATION_CITY}}**：出现在book_flight、book_trip等意图中该全面的实体列表确保本数据集能够充分训练能够熟练理解与处理各类旅游相关查询与任务的模型。 **语言生成标签** 本数据集包含能够体现各类语言变体与风格的标签，适配旅游领域场景，可提升基于该数据训练的模型的鲁棒性与通用性。这些标签将话语划分为不同语体，例如口语化、正式语体或包含特定旅游行话，确保训练后的模型能够理解并生成适用于旅游领域不同客户交互场景的多样化对话风格。 **语言生成标签** 本数据集包含能够反映各类语言变体与风格的标签，这对于构建旅游领域内具备适应性与响应能力的对话式AI模型至关重要。这些标签有助于基于语言语境与用户交互风格理解并生成恰当的回复。 **词汇变体标签** - **M - 形态变体**：适配屈折与派生形式。示例："is my account active"、"is my account activated" - **L - 语义变体**：处理同义词、连字符使用与复合构词。示例："what's my balance date"、"what's my billing date" **句法结构变体标签** - **B - 基础句法结构**：简单直接的命令或陈述。示例："activate my card"、"I need to check my balance" - **I - 疑问结构**：采用问句形式组织句子。示例："can you show my balance?"、"how do I transfer money?" - **C - 并列句法结构**：包含多个想法或任务的复合句。示例："I want to transfer money and check my balance, what should I do?" - **N - 否定**：表达否认或矛盾。示例："I do not wish to proceed with this transaction, how can I stop it?" **语体变体标签** - **P - 礼貌变体**：客服场景中常用的礼貌表达。示例："could you please help me check my account balance?" - **Q - 口语变体**：非正式语言，可用于休闲客户交互场景。示例："can u tell me my balance?" - **W - 冒犯性语言**：处理客户沮丧交互中可能出现的潜在冒犯性语言。示例："I’m upset with these charges, this is ridiculous!" **风格变体标签** - **K - 关键词模式**：以关键词为核心的回复。示例："balance check"、"account status" - **E - 缩写使用**：常见缩写形式。示例："acct for account"、"trans for transaction" - **Z - 错误与拼写失误**：包含客户输入中常见的拼写错误或笔误。示例："how can I chek my balance" **本数据集未使用的其他标签** - **D - 间接引语**：间接表达命令或请求。示例："I was wondering if you could show me my last transaction." - **G - 地域变体**：适配区域语言差异。示例：美式英语与英式英语对比："checking account" vs "current account" - **R - 尊重结构 - 语言相关变体**：不同语言下适配的正式程度。示例：法语中使用"vous"进行正式称呼，而非"tu" - **Y - 代码切换**：同一场对话中切换语言或方言。示例："Can you help me with my cuenta, please?" 这些标签不仅有助于训练模型应对广泛的客户交互场景，还能确保模型具备文化与语言敏感性，提升旅游环境中的客户体验。 **许可协议** 本`Bitext-travel-llm-chatbot-training-dataset`数据集采用**社区数据许可协议（Community Data License Agreement，CDLA）Sharing 1.0**进行发布。该许可协议支持广泛的共享与协作，同时确保所有用户使用、分享、修改及利用该数据集的自由不受损害。 **CDLA-Sharing 1.0核心条款** - **署名与相同方式共享**：用户必须对本数据集进行署名，并在相同许可协议下共享衍生作品 - **非排他性**：本许可为非排他性许可，允许多名用户同时使用该数据集 - **不可撤销性**：除重大违规情形外，本许可下的权利不可撤销 - **无担保**：本数据集按现状提供，不就其准确性、完整性或特定用途适用性作出任何担保 - **责任限制**：用户与数据提供方均对因使用本数据集而产生的损害赔偿责任进行限制 **CDLA-Sharing 1.0协议下的使用规范** 使用`Bitext-travel-llm-chatbot-training-dataset`数据集即表示您同意遵守CDLA-Sharing 1.0协议的各项条款。任何对本数据集或其衍生作品的发布、分发，均必须保留对原始数据提供方的署名，并采用本协议或兼容的许可条款进行分发。如需详细了解该许可协议，请参阅[CDLA-Sharing 1.0官方文档](https://cdla.dev/sharing-1-0/)。该许可协议支持AI与数据科学社区内数据集的开放共享与协作改进，尤其适用于旨在开发与增强旅游领域AI技术的项目。 --- © Bitext Innovations, 2024

提供机构：

Bitext Innovation International

搜集汇总

数据集介绍

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集