five

rescommons/Full-Ecom-Chatbot-Dataset

收藏
Hugging Face2026-03-27 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/rescommons/Full-Ecom-Chatbot-Dataset
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - machine-generated - expert-generated language_creators: - machine-generated - found language: - en license: mit multilinguality: - monolingual size_categories: - 10K<n<100K source_datasets: - original task_categories: - question-answering - text-generation - other task_ids: - dialogue-modeling - open-domain-qa pretty_name: E-commerce Chatbot Training Data tags: - ecommerce - chatbot - tool-use - customer-support - retail - conversational-ai --- # E-commerce Chatbot Training Data A curated, multi-source dataset for training and evaluating e-commerce conversational AI systems. It covers a broad range of customer intents — from product discovery and order management to returns, tool-augmented responses, and RAG-grounded Q&A — across 16+ product domains. ## Dataset Summary | Split | Records | |-------|---------| | Train | 35,213 | | Test | 8,818 | | **Total** | **44,031** | The train/test split uses **prompt-group-level stratified sampling** on `source × response_type × intent × difficulty` to guarantee identical distributions across both splits with zero prompt contamination between train and test. --- ## Sources | Source | Records | Response Types | Domains | Intents | |--------|---------|----------------|---------|---------| | `synthetic_api_generated` | 3,933 | text, tool_call, mixed | 12 | 19 | | `asos_ecom_dataset` | 2,000 | text | fashion | similarity_search | | `bitext_customer_support` | 5,000 | tool_call, mixed | general | 6 | | `bitext_retail_ecom` | 4,998 | text, tool_call | general | multiple | | `amazon_reviews_2023_*` | 23,100 | text | 16 | 4 | | `amazon_meta_2023_*` | 5,000 | text | 9 | 4 | --- ## Schema | Field | Type | Description | |-------|------|-------------| | `id` | string | Unique record ID (e.g. `ecomm_a1b2c3`) | | `source` | string | Origin dataset/pipeline | | `group` | string | Response group: `A` (tool_call), `B` (text), `C` (mixed) | | `difficulty` | int | Task difficulty: `1` (easy) to `3` (hard) | | `system` | string | System prompt given to the assistant | | `history` | string (JSON) | Prior conversation turns `[{"role": ..., "content": ...}]` | | `prompt` | string | Current user message | | `context` | string (JSON) | Retrieved docs, user profile, cart/order state | | `tools` | string (JSON) | Available tool/function definitions | | `response_type` | string | `text`, `tool_call`, or `mixed` | | `response` | string | Ground-truth assistant response | | `language` | string | ISO language code (e.g. `en`) | | `locale` | string | Locale (e.g. `en-US`) | | `annotator` | string | Annotation source (e.g. `gemini_synthetic`, `bitext`, `amazon_user`) | | `quality_score` | float | Annotation quality score (0–1) | | `domain` | string | Product domain (e.g. `electronics`, `fashion`, `grocery_food`) | | `intent_category` | string | High-level intent category (e.g. `product_discovery`, `order_management`) | | `intent` | string | Fine-grained intent (19 values, e.g. `order_status`, `return_refund`) | | `sub_intent` | string | Further sub-intent (e.g. `track_delivery`, `refund_timeline`) | | `capability` | string | Model capability tag (where applicable) | | `test_tier` | string | Evaluation tier tag (where applicable) | --- ## Intents The dataset covers 19 intents across 7 high-level categories: | Category | Intents | |----------|---------| | Product Discovery | `product_search`, `product_detail_qa`, `product_comparison`, `similarity_search`, `bundle_suggestions`, `gift_recommendation`, `personalized_recommendations` | | Order Management | `order_status`, `order_cancellation`, `reorder_assistance` | | Returns & Exchanges | `return_refund`, `exchange_request` | | Cart & Checkout | `cart_management`, `payment_issues` | | Customer Support | `complaint_handling`, `human_handoff`, `faq_answering` | | Account | `account_management` | | Inventory | `stock_availability` | --- ## Product Domains `appliances`, `beauty`, `books_media`, `electronics`, `fashion`, `gaming`, `garden_outdoor`, `grocery_food`, `home_kitchen`, `industrial`, `pet_supplies`, `sports_outdoors`, `automotive`, `baby`, `health`, `office`, `toys_games` --- ## Usage ```python from datasets import load_dataset ds = load_dataset("V1rtucious/ecom-chatbot-train-data") train = ds["train"] test = ds["test"] # Filter by response type tool_call_examples = train.filter(lambda x: x["response_type"] == "tool_call") # Filter by intent order_queries = train.filter(lambda x: x["intent"] == "order_status") ``` --- ## Split Methodology Both splits were produced using **prompt-group-level stratified sampling** to ensure zero contamination, maximum variance, and minimum bias: - **Stratification key:** `source | response_type | intent | difficulty` - **Splitting unit:** unique `(source, prompt)` groups — all records sharing a prompt are assigned atomically to one split - **40,949 prompt groups** across 44,031 records; 3,082 records share a prompt with at least one other record - **Fallback cascade** for rare strata (< 5 groups): drops `difficulty`, then drops to `source` only - **113 unique strata** | **Random seed:** 42 (reproducible) - **Prompt contamination between splits: 0** (verified post-split) Distribution drift between train and test is < 0.35% across all key columns. --- ## License This dataset is released under the **MIT License**. Individual source data may carry additional terms from their original providers (Amazon, ASOS, Bitext).

annotations_creators: - 机器生成(machine-generated) - 专家生成(expert-generated) language_creators: - 机器生成(machine-generated) - 公开获取(found) language: - 英语(en) license: MIT许可证(MIT) multilinguality: - 单语言(monolingual) size_categories: - 10K<n<100K source_datasets: - 原始数据集(original) task_categories: - 问答(question-answering) - 文本生成(text-generation) - 其他(other) task_ids: - 对话建模(dialogue-modeling) - 开放域问答(open-domain-qa) pretty_name: 电子商务聊天机器人训练数据(E-commerce Chatbot Training Data) tags: - 电子商务(ecommerce) - 聊天机器人(chatbot) - 工具使用(tool-use) - 客户支持(customer-support) - 零售(retail) - 对话式AI(conversational-ai) # 电子商务聊天机器人训练数据集(E-commerce Chatbot Training Data) 本数据集为经精心整理的多源数据集,用于训练与评估电子商务对话式AI(conversational AI)系统。其覆盖16余个产品领域,涵盖丰富的用户意图,从商品发现、订单管理,到退换货、工具增强型回复,以及基于检索增强生成(Retrieval-Augmented Generation,RAG)的问答任务。 ## 数据集概览 | 拆分布局 | 样本数量 | |-------|---------| | 训练集 | 35,213 | | 测试集 | 8,818 | | **总计** | **44,031** | 本次训练集与测试集的划分采用**基于提示词组的分层抽样(prompt-group-level stratified sampling)**,抽样维度为`数据源 × 回复类型 × 意图 × 难度`,以确保两个拆分集合的分布完全一致,且训练集与测试集之间无任何提示词污染(prompt contamination)。 --- ## 数据来源 | 数据源 | 样本数量 | 回复类型 | 覆盖领域 | 意图数量 | |--------|---------|----------------|---------|---------| | `synthetic_api_generated` | 3,933 | 文本、工具调用、混合 | 12个 | 19种 | | `asos_ecom_dataset` | 2,000 | 仅文本 | 时尚领域 | 相似度搜索 | | `bitext_customer_support` | 5,000 | 工具调用、混合 | 通用领域 | 6种 | | `bitext_retail_ecom` | 4,998 | 文本、工具调用 | 通用领域 | 多种 | | `amazon_reviews_2023_*` | 23,100 | 仅文本 | 16个领域 | 4种 | | `amazon_meta_2023_*` | 5,000 | 仅文本 | 9个领域 | 4种 | --- ## 数据 Schema | 字段名 | 数据类型 | 字段说明 | |-------|------|-------------| | `id` | 字符串(string) | 唯一记录ID(示例:`ecomm_a1b2c3`) | | `source` | 字符串 | 原始数据集/处理流水线来源 | | `group` | 字符串 | 回复分组:`A`(工具调用)、`B`(文本回复)、`C`(混合回复) | | `difficulty` | 整数(int) | 任务难度:`1`(简单)至`3`(困难) | | `system` | 字符串 | 分配给助手的系统提示词 | | `history` | JSON格式字符串 | 历史对话轮次,格式为`[{"role": ..., "content": ...}]` | | `prompt` | 字符串 | 当前用户输入的消息 | | `context` | JSON格式字符串 | 检索到的文档、用户个人资料、购物车/订单状态 | | `tools` | JSON格式字符串 | 可用工具/函数定义 | | `response_type` | 字符串 | 回复类型:`text`(文本)、`tool_call`(工具调用)或`mixed`(混合) | | `response` | 字符串 | 助手的真实标注回复 | | `language` | 字符串 | ISO语言代码(示例:`en`) | | `locale` | 字符串 | 区域设置(示例:`en-US`) | | `annotator` | 字符串 | 注释来源(示例:`gemini_synthetic`、`bitext`、`amazon_user`) | | `quality_score` | 浮点数(float) | 注释质量评分(范围0至1) | | `domain` | 字符串 | 产品领域(示例:`electronics`(电子产品)、`fashion`(时尚)、`grocery_food`(食品杂货)) | | `intent_category` | 字符串 | 高级意图类别(示例:`product_discovery`(商品发现)、`order_management`(订单管理)) | | `intent` | 字符串 | 细粒度意图(共19种,示例:`order_status`(订单状态查询)、`return_refund`(退款退货)) | | `sub_intent` | 字符串 | 进一步细分的子意图(示例:`track_delivery`(物流追踪)、`refund_timeline`(退款进度查询)) | | `capability` | 字符串 | 模型能力标签(适用于相关场景) | | `test_tier` | 字符串 | 评估层级标签(适用于相关场景) | --- ## 意图分类 本数据集涵盖7个高级类别下的19种细粒度意图: | 高级意图类别 | 细粒度意图 | |----------|---------| | 商品发现(Product Discovery) | `product_search`(商品搜索)、`product_detail_qa`(商品详情问答)、`product_comparison`(商品对比)、`similarity_search`(相似度搜索)、`bundle_suggestions`(套餐推荐)、`gift_recommendation`(礼品推荐)、`personalized_recommendations`(个性化推荐) | | 订单管理(Order Management) | `order_status`(订单状态查询)、`order_cancellation`(订单取消)、`reorder_assistance`(重新下单协助) | | 退换货与换货(Returns & Exchanges) | `return_refund`(退款退货)、`exchange_request`(换货申请) | | 购物车与结账(Cart & Checkout) | `cart_management`(购物车管理)、`payment_issues`(支付问题) | | 客户支持(Customer Support) | `complaint_handling`(投诉处理)、`human_handoff`(转人工客服)、`faq_answering`(常见问题解答) | | 账户管理(Account) | `account_management`(账户管理) | | 库存查询(Inventory) | `stock_availability`(库存可用性查询) | --- ## 产品覆盖领域 `appliances`(家电)、`beauty`(美妆)、`books_media`(图书音像)、`electronics`(电子产品)、`fashion`(时尚服饰)、`gaming`(游戏)、`garden_outdoor`(园艺户外)、`grocery_food`(食品杂货)、`home_kitchen`(家居厨房)、`industrial`(工业用品)、`pet_supplies`(宠物用品)、`sports_outdoors`(运动户外)、`automotive`(汽车用品)、`baby`(母婴)、`health`(健康保健)、`office`(办公用品)、`toys_games`(玩具游戏) --- ## 使用方法 python from datasets import load_dataset ds = load_dataset("V1rtucious/ecom-chatbot-train-data") train = ds["train"] test = ds["test"] # 按回复类型筛选 tool_call_examples = train.filter(lambda x: x["response_type"] == "tool_call") # 按意图筛选 order_queries = train.filter(lambda x: x["intent"] == "order_status") --- ## 拆分方法说明 本次拆分采用**基于提示词组的分层抽样**方法,以确保无提示词污染、样本方差最大化与偏差最小化: - **分层维度**:`数据源 | 回复类型 | 意图 × 难度` - **拆分单元**:唯一的`(数据源, 提示词)`组 — 所有共享同一提示词的记录会被整体分配至同一个拆分集合 - 44,031条样本共包含40,949个提示词组;其中3,082条样本与至少一条其他样本共享同一提示词 - **稀有分层降级策略**:对于样本量小于5个组的稀有分层,依次移除`难度`维度,最终仅保留`数据源`维度 - 共113个唯一分层 | **随机种子**:42(可复现) - 拆分集合间的提示词污染率为0(拆分后已验证) 训练集与测试集在所有关键列上的分布漂移率均小于0.35%。 --- ## 许可证 本数据集采用**MIT许可证**进行发布。各原始数据源可能附带其原始提供方(Amazon(亚马逊)、ASOS、Bitext)的额外条款。
提供机构:
rescommons
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作