bitext/Bitext-retail-ecommerce-llm-chatbot-training-dataset

Name: bitext/Bitext-retail-ecommerce-llm-chatbot-training-dataset
Creator: bitext
Published: 2024-08-05 23:04:53
License: 暂无描述

Hugging Face2024-08-05 更新2025-04-12 收录

下载链接：

https://hf-mirror.com/datasets/bitext/Bitext-retail-ecommerce-llm-chatbot-training-dataset

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cdla-sharing-1.0 task_categories: - question-answering - table-question-answering language: - en tags: - question-answering - llm - chatbot - retail - ecommerce - conversational-ai - generative-ai - natural-language-understanding - fine-tuning pretty_name: >- Bitext - Retail (eCommerce) Tagged Training Dataset for LLM-based Virtual Assistants size_categories: - 10K<n<100K --- # Bitext - Retail (eCommerce) Tagged Training Dataset for LLM-based Virtual Assistants ## Overview This hybrid synthetic dataset is designed to be used to fine-tune Large Language Models such as GPT, Mistral and OpenELM, and has been generated using our NLP/NLG technology and our automated Data Labeling (DAL) tools. The goal is to demonstrate how Verticalization/Domain Adaptation for the [Retail (eCommerce)] sector can be easily achieved using our two-step approach to LLM Fine-Tuning. An overview of this approach can be found at: [From General-Purpose LLMs to Verticalized Enterprise Models](https://www.bitext.com/blog/general-purpose-models-verticalized-enterprise-genai/) The dataset has the following specifications: - Use Case: Intent Detection - Vertical: Retail (eCommerce) - 46 intents assigned to 13 categories - 44884 question/answer pairs, with approximately 1000 per intent - 2970 entity/slot types - 10 different types of language generation tags The categories and intents are derived from Bitext's extensive experience across various industry-specific datasets, ensuring the relevance and applicability across diverse contexts. ## Dataset Token Count The dataset contains a total of 8.47 million tokens across 'instruction' and 'response' columns. This extensive corpus is crucial for training sophisticated LLMs that can perform a variety of functions including conversational AI, question answering, and virtual assistant tasks in the retail (eCommerce) domain. ## Fields of the Dataset Each entry in the dataset comprises the following fields: - tags - instruction: a user request from the Retail (eCommerce) domain - category: the high-level semantic category for the intent - intent: the specific intent corresponding to the user instruction - response: an example of an expected response from the virtual assistant ## Categories and Intents The dataset covers a wide range of retail-related categories and intents, which are: - **CONTACT**: customer_service, human_agent - **ACCOUNT**: change_account, close_account, open_account, order_history, recover_password - **APP_WEBSITE**: technical_issue, use_app - **CART**: add_product, remove_product - **DELIVERY**: damaged_delivery, delivery_issue, delivery_time, missing_item, shipping_costs, track_delivery, wrong_item - **FEEDBACK**: submit_feedback, submit_product_feedback, submit_product_idea - **ORDER**: cancel_order, change_order, request_invoice, track_order - **PAYMENT**: pay, payment_issue, payment_methods - **PRODUCT**: availability, availability_in_store, availability_online, exchange_product, exchange_product_in_store, product_information, product_issue - **RETURNS**: refund_policy, refund_status, request_refund, return_policy, return_product, return_product_in_store, return_product_online - **SALES**: sales_period - **STORE**: store_location, store_opening_hours - **USER**: request_right_to_rectification ## Entities The entities covered by the dataset include: - **{{Customer Support Phone Number}}**, **{{Company Website URL}}**, common with all intents. - **{{Email Address}}**, **{{Name}}**, common with most intents. - **{{Order History}}**, featured in intents like cancel_order, change_order. - **{{My Orders}}**, featured in intents like exchange_product, order_history, request_invoice. - **{{Store Location}}**, relevant to intents such as availability_in_store, exchange_product_in_store. - **{{Returns}}**, associated with intents like exchange_product, refund_policy, refund_status. - **{{Account Settings}}**, important for intents including change_account, close_account. - **{{Return Policy}}**, typically present in intents such as refund_policy. - **{{Tracking Number}}**, featured in intents like delivery_issue, delivery_time. - **{{Refunds}}**, relevant to intents such as exchange_product, return_product. - **{{Edit Order}}**, associated with intents like change_order, remove_product. - **{{Purchase History}}**, important for intents including order_history, request_refund. This comprehensive list of entities ensures that the dataset is well-equipped to train models that are highly adept at understanding and processing a wide range of retail-related queries and tasks. ## Language Generation Tags The dataset includes tags indicative of various language variations and styles adapted for Retail (eCommerce), enhancing the robustness and versatility of models trained on this data. These tags categorize the utterances into different registers such as colloquial, formal, or containing specific retail jargon, ensuring that the trained models can understand and generate a range of conversational styles appropriate for different customer interactions in the retail sector. ## Language Generation Tags The dataset includes tags that reflect various language variations and styles, crucial for creating adaptable and responsive conversational AI models within the retail sector. These tags help in understanding and generating appropriate responses based on the linguistic context and user interaction style. ### Tags for Lexical variation - **M - Morphological variation**: Adjusts for inflectional and derivational forms. - Example: "is my account active", "is my account activated" - **L - Semantic variations**: Handles synonyms, use of hyphens, and compounding. - Example: “what's my balance date", “what's my billing date” ### Tags for Syntactic structure variation - **B - Basic syntactic structure**: Simple, direct commands or statements. - Example: "activate my card", "I need to check my balance" - **I - Interrogative structure**: Structuring sentences in the form of questions. - Example: “can you show my balance?”, “how do I transfer money?” - **C - Coordinated syntactic structure**: Complex sentences coordinating multiple ideas or tasks. - Example: “I want to transfer money and check my balance, what should I do?” - **N - Negation**: Expressing denial or contradiction. - Example: "I do not wish to proceed with this transaction, how can I stop it?" ### Tags for language register variations - **P - Politeness variation**: Polite forms often used in customer service. - Example: “could you please help me check my account balance?” - **Q - Colloquial variation**: Informal language that might be used in casual customer interactions. - Example: "can u tell me my balance?" - **W - Offensive language**: Handling potentially offensive language which might occasionally appear in frustrated customer interactions. - Example: “I’m upset with these charges, this is ridiculous!” ### Tags for stylistic variations - **K - Keyword mode**: Responses focused on keywords. - Example: "balance check", "account status" - **E - Use of abbreviations**: Common abbreviations. - Example: “acct for account”, “trans for transaction” - **Z - Errors and Typos**: Includes common misspellings or typographical errors found in customer inputs. - Example: “how can I chek my balance” ### Other tags not in use in this Dataset - **D - Indirect speech**: Expressing commands or requests indirectly. - Example: “I was wondering if you could show me my last transaction.” - **G - Regional variations**: Adjustments for regional language differences. - Example: American vs British English: "checking account" vs "current account" - **R - Respect structures - Language-dependent variations**: Formality levels appropriate in different languages. - Example: Using “vous” in French for formal addressing instead of “tu.” - **Y - Code switching**: Switching between languages or dialects within the same conversation. - Example: “Can you help me with my cuenta, please?” These tags not only aid in training models for a wide range of customer interactions but also ensure that the models are culturally and linguistically sensitive, enhancing the customer experience in retail environments. ## License The `Bitext-retail-ecommerce-llm-chatbot-training-dataset` is released under the **Community Data License Agreement (CDLA) Sharing 1.0**. This license facilitates broad sharing and collaboration while ensuring that the freedom to use, share, modify, and utilize the data remains intact for all users. ### Key Aspects of CDLA-Sharing 1.0 - **Attribution and ShareAlike**: Users must attribute the dataset and continue to share derivatives under the same license. - **Non-Exclusivity**: The license is non-exclusive, allowing multiple users to utilize the data simultaneously. - **Irrevocability**: Except in cases of material non-compliance, rights under this license are irrevocable. - **No Warranty**: The dataset is provided without warranties regarding its accuracy, completeness, or fitness for a particular purpose. - **Limitation of Liability**: Both users and data providers limit their liability for damages arising from the use of the dataset. ### Usage Under CDLA-Sharing 1.0 By using the `Bitext-retail-ecommerce-llm-chatbot-training-dataset`, you agree to adhere to the terms set forth in the CDLA-Sharing 1.0. It is essential to ensure that any publications or distributions of the data, or derivatives thereof, maintain attribution to the original data providers and are distributed under the same or compatible terms of this agreement. For a detailed understanding of the license, refer to the [official CDLA-Sharing 1.0 documentation](https://cdla.dev/sharing-1-0/). This license supports the open sharing and collaborative improvement of datasets within the AI and data science community, making it particularly suited for projects aimed at developing and enhancing AI technologies in the retail sector. --- (c) Bitext Innovations, 2024

许可证：CDLA-Sharing 1.0 任务类别： - 问答 - 表格问答语言： - 英语标签： - 问答 - 大语言模型（LLM） - 聊天机器人 - 零售 - 电子商务（eCommerce） - 会话式AI - 生成式AI - 自然语言理解 - 微调数据集展示名称：Bitext 面向基于大语言模型的虚拟助手的零售（电子商务）标注训练数据集样本规模分类：10K<n<100K ## Bitext 面向基于大语言模型的虚拟助手的零售（电子商务）标注训练数据集 ## 概述本混合合成数据集专为微调大型语言模型（Large Language Model，LLM）而设计，适配模型包括GPT、Mistral与OpenELM，其生成依托我方自然语言处理（NLP）、自然语言生成（NLG）技术及自动化数据标注（DAL）工具。本数据集旨在展示，依托我方提出的大语言模型微调两步法，可轻松实现零售（电子商务）领域的垂直化适配/领域自适应。该方法的详细概述可参阅：《从通用大语言模型到垂直化企业模型》（https://www.bitext.com/blog/general-purpose-models-verticalized-enterprise-genai/）本数据集具备以下规格： - 用例：意图识别 - 应用领域：零售（电子商务） - 覆盖13个类别下的46个意图 - 包含44884个问答对，每个意图对应约1000条数据 - 2970种实体/槽位类型 - 10种语言生成标签类型本数据集的类别与意图源自Bitext在多行业专属数据集领域的丰富经验，可确保在多样场景下的相关性与适用性。 ## 数据集Token统计本数据集在「指令」与「回复」两列中共计包含847万个Token。该大规模语料库对于训练能够在零售（电子商务）领域执行会话式AI、问答及虚拟助手等多种任务的高性能大语言模型至关重要。 ## 数据集字段数据集中的每条样本包含以下字段： - 标签 - instruction：来自零售（电子商务）领域的用户请求 - category：该意图对应的高层语义类别 - intent：与用户指令对应的具体意图 - response：虚拟助手的预期回复示例 ## 类别与意图本数据集覆盖广泛的零售相关类别与意图，具体如下： - **联系（CONTACT）**：客服咨询（customer_service）、人工客服（human_agent） - **账户（ACCOUNT）**：账户变更（change_account）、账户注销（close_account）、账户开立（open_account）、订单历史（order_history）、密码找回（recover_password） - **应用与网站（APP_WEBSITE）**：技术故障（technical_issue）、应用使用（use_app） - **购物车（CART）**：添加商品（add_product）、移除商品（remove_product） - **配送（DELIVERY）**：配送商品损坏（damaged_delivery）、配送问题（delivery_issue）、配送时效（delivery_time）、商品缺失（missing_item）、配送费用（shipping_costs）、配送追踪（track_delivery）、商品错发（wrong_item） - **反馈（FEEDBACK）**：提交反馈（submit_feedback）、提交商品反馈（submit_product_feedback）、提交商品创意（submit_product_idea） - **订单（ORDER）**：取消订单（cancel_order）、修改订单（change_order）、索取发票（request_invoice）、订单追踪（track_order） - **支付（PAYMENT）**：支付操作（pay）、支付问题（payment_issue）、支付方式（payment_methods） - **商品（PRODUCT）**：库存情况（availability）、门店库存（availability_in_store）、线上库存（availability_online）、商品换货（exchange_product）、门店换货（exchange_product_in_store）、商品信息查询（product_information）、商品问题（product_issue） - **退换货（RETURNS）**：退款政策（refund_policy）、退款状态（refund_status）、申请退款（request_refund）、退货政策（return_policy）、商品退货（return_product）、门店退货（return_product_in_store）、线上退货（return_product_online） - **促销（SALES）**：促销时段（sales_period） - **门店（STORE）**：门店位置（store_location）、门店营业时间（store_opening_hours） - **用户权益（USER）**：要求更正信息（request_right_to_rectification） ## 实体本数据集覆盖的实体包括： - **{{客服电话号码}}**、**{{企业官网URL}}**：所有意图通用 - **{{电子邮箱地址}}**、**{{姓名}}**：多数意图通用 - **{{订单历史}}**：用于取消订单、修改订单等意图 - **{{我的订单}}**：用于商品换货、订单历史查询、索取发票等意图 - **{{门店位置}}**：适配门店库存查询、门店换货等意图 - **{{退换货相关}}**：关联商品换货、退款政策、退款状态等意图 - **{{账户设置}}**：适用于账户变更、账户注销等意图 - **{{退货政策}}**：常见于退款政策相关意图 - **{{追踪单号}}**：用于配送问题、配送时效查询等意图 - **{{退款相关}}**：适配商品换货、商品退货等意图 - **{{订单编辑}}**：关联修改订单、移除购物车商品等意图 - **{{购买历史}}**：适用于订单历史查询、申请退款等意图该全面的实体列表确保本数据集能够充分训练出可熟练理解并处理各类零售相关查询与任务的模型。 ## 语言生成标签本数据集包含适配零售（电子商务）领域的各类语言变体与风格标签，可提升基于该数据集训练的模型的鲁棒性与通用性。此类标签将话语划分为不同语体，例如口语化、正式语体或包含特定零售行业术语的语体，确保训练后的模型能够理解并生成适配零售领域不同客户交互场景的多样会话风格。 ## 语言生成标签本数据集包含各类语言变体与风格标签，对于在零售领域打造适配性强、响应灵敏的会话式AI模型至关重要。此类标签有助于基于语言语境与用户交互风格理解并生成恰当回复。 ### 词汇变体标签 - **M - 形态变体**：适配屈折与派生形式示例："is my account active"（我的账户是否处于激活状态）、"is my account activated"（我的账户是否已被激活） - **L - 语义变体**：覆盖同义词、连字符使用及复合词构词示例："what's my balance date"（我的结算日期是什么时候）、"what's my billing date"（我的账单日期是什么时候） ### 句法结构变体标签 - **B - 基础句法结构**：简单直接的命令或陈述示例："activate my card"（激活我的卡片）、"I need to check my balance"（我需要查询我的余额） - **I - 疑问结构**：以疑问句形式组织语句示例："can you show my balance?"（你能帮我查询我的余额吗？）、"how do I transfer money?"（我该如何转账？） - **C - 并列句法结构**：包含多个关联想法或任务的复合句示例："I want to transfer money and check my balance, what should I do?"（我想要转账并查询我的余额，我该怎么做？） - **N - 否定结构**：表达否认或矛盾示例："I do not wish to proceed with this transaction, how can I stop it?"（我不想继续这笔交易，我该如何终止它？） ### 语体变体标签 - **P - 礼貌变体**：客服场景中常用的礼貌表达示例："could you please help me check my account balance?"（能否请您帮我查询一下我的账户余额？） - **Q - 口语变体**：休闲客户交互中可能使用的非正式语言示例："can u tell me my balance?"（能告诉我我的余额吗？） - **W - 冒犯性语言**：处理客户情绪激动时偶尔出现的潜在冒犯性语言示例："I’m upset with these charges, this is ridiculous!"（我对这些收费感到不满，这太荒谬了！） ### 风格变体标签 - **K - 关键词模式**：以关键词为核心的回复示例："balance check"（余额查询）、"account status"（账户状态） - **E - 缩写使用**：通用缩写形式示例："acct for account"（acct代表account，即账户）、"trans for transaction"（trans代表transaction，即交易） - **Z - 错误与拼写失误**：涵盖客户输入中常见的拼写错误或排版失误示例："how can I chek my balance"（我该如何查询我的余额？，原文为chek，应为check） ### 本数据集未使用的标签 - **D - 间接引语**：间接表达命令或请求示例："I was wondering if you could show me my last transaction."（我想知道您能否帮我查看一下我的最近一笔交易。） - **G - 地域变体**：适配不同地区的语言差异示例：美式英语与英式英语对比："checking account"（活期存款账户）对应 "current account" - **R - 敬语结构**：依赖语言的变体，适配不同语言中的正式程度示例：法语中使用"vous"进行正式称呼，而非"tu" - **Y - 语码转换**：同一场对话中在不同语言或方言间切换示例："Can you help me with my cuenta, please?"（你能帮我处理一下我的cuenta吗？，cuenta为西班牙语，意为账户）此类标签不仅有助于训练适配多样客户交互场景的模型，还可确保模型具备文化与语言敏感性，从而提升零售场景中的客户体验。 ## 许可证本`Bitext-retail-ecommerce-llm-chatbot-training-dataset`数据集采用**社区数据许可协议（CDLA）共享版1.0**发布。该许可证支持广泛的共享与协作，同时确保所有用户使用、共享、修改及利用该数据集的自由不受侵犯。 ### CDLA-Sharing 1.0核心条款 - **署名与相同方式共享**：用户必须对本数据集进行署名，并以相同许可协议共享衍生作品 - **非排他性**：本许可证为非排他性许可，允许多个用户同时使用该数据集 - **不可撤销性**：除非出现严重违约情况，本许可证项下的权利不可撤销 - **无担保**：本数据集按现状提供，不就其准确性、完整性或特定用途适用性作出任何担保 - **责任限制**：用户与数据提供方均不对因使用本数据集而产生的损害承担责任 ### CDLA-Sharing 1.0项下的使用规则使用本`Bitext-retail-ecommerce-llm-chatbot-training-dataset`数据集即表示您同意遵守CDLA-Sharing 1.0的各项条款。无论以何种形式发布或分发本数据集及其衍生作品，均必须保留对原始数据提供方的署名，并采用本协议或兼容的许可条款进行分发。如需详细了解本许可证，请参阅[CDLA-Sharing 1.0官方文档](https://cdla.dev/sharing-1-0/)。本许可证支持AI与数据科学社区内数据集的开放共享与协作改进，尤其适用于旨在开发与优化零售领域AI技术的项目。 © Bitext Innovations, 2024

提供机构：

bitext

5,000+

优质数据集

54 个

任务类型

进入经典数据集