rescommons/Full-Ecom-Chatbot-Dataset

Name: rescommons/Full-Ecom-Chatbot-Dataset
Creator: rescommons
Published: 2026-03-27 01:00:16
License: 暂无描述

Hugging Face2026-03-27 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/rescommons/Full-Ecom-Chatbot-Dataset

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - machine-generated - expert-generated language_creators: - machine-generated - found language: - en license: mit multilinguality: - monolingual size_categories: - 10K<n<100K source_datasets: - original task_categories: - question-answering - text-generation - other task_ids: - dialogue-modeling - open-domain-qa pretty_name: E-commerce Chatbot Training Data tags: - ecommerce - chatbot - tool-use - customer-support - retail - conversational-ai --- # E-commerce Chatbot Training Data A curated, multi-source dataset for training and evaluating e-commerce conversational AI systems. It covers a broad range of customer intents — from product discovery and order management to returns, tool-augmented responses, and RAG-grounded Q&A — across 16+ product domains. ## Dataset Summary | Split | Records | |-------|---------| | Train | 35,213 | | Test | 8,818 | | **Total** | **44,031** | The train/test split uses **prompt-group-level stratified sampling** on `source × response_type × intent × difficulty` to guarantee identical distributions across both splits with zero prompt contamination between train and test. --- ## Sources | Source | Records | Response Types | Domains | Intents | |--------|---------|----------------|---------|---------| | `synthetic_api_generated` | 3,933 | text, tool_call, mixed | 12 | 19 | | `asos_ecom_dataset` | 2,000 | text | fashion | similarity_search | | `bitext_customer_support` | 5,000 | tool_call, mixed | general | 6 | | `bitext_retail_ecom` | 4,998 | text, tool_call | general | multiple | | `amazon_reviews_2023_*` | 23,100 | text | 16 | 4 | | `amazon_meta_2023_*` | 5,000 | text | 9 | 4 | --- ## Schema | Field | Type | Description | |-------|------|-------------| | `id` | string | Unique record ID (e.g. `ecomm_a1b2c3`) | | `source` | string | Origin dataset/pipeline | | `group` | string | Response group: `A` (tool_call), `B` (text), `C` (mixed) | | `difficulty` | int | Task difficulty: `1` (easy) to `3` (hard) | | `system` | string | System prompt given to the assistant | | `history` | string (JSON) | Prior conversation turns `[{"role": ..., "content": ...}]` | | `prompt` | string | Current user message | | `context` | string (JSON) | Retrieved docs, user profile, cart/order state | | `tools` | string (JSON) | Available tool/function definitions | | `response_type` | string | `text`, `tool_call`, or `mixed` | | `response` | string | Ground-truth assistant response | | `language` | string | ISO language code (e.g. `en`) | | `locale` | string | Locale (e.g. `en-US`) | | `annotator` | string | Annotation source (e.g. `gemini_synthetic`, `bitext`, `amazon_user`) | | `quality_score` | float | Annotation quality score (0–1) | | `domain` | string | Product domain (e.g. `electronics`, `fashion`, `grocery_food`) | | `intent_category` | string | High-level intent category (e.g. `product_discovery`, `order_management`) | | `intent` | string | Fine-grained intent (19 values, e.g. `order_status`, `return_refund`) | | `sub_intent` | string | Further sub-intent (e.g. `track_delivery`, `refund_timeline`) | | `capability` | string | Model capability tag (where applicable) | | `test_tier` | string | Evaluation tier tag (where applicable) | --- ## Intents The dataset covers 19 intents across 7 high-level categories: | Category | Intents | |----------|---------| | Product Discovery | `product_search`, `product_detail_qa`, `product_comparison`, `similarity_search`, `bundle_suggestions`, `gift_recommendation`, `personalized_recommendations` | | Order Management | `order_status`, `order_cancellation`, `reorder_assistance` | | Returns & Exchanges | `return_refund`, `exchange_request` | | Cart & Checkout | `cart_management`, `payment_issues` | | Customer Support | `complaint_handling`, `human_handoff`, `faq_answering` | | Account | `account_management` | | Inventory | `stock_availability` | --- ## Product Domains `appliances`, `beauty`, `books_media`, `electronics`, `fashion`, `gaming`, `garden_outdoor`, `grocery_food`, `home_kitchen`, `industrial`, `pet_supplies`, `sports_outdoors`, `automotive`, `baby`, `health`, `office`, `toys_games` --- ## Usage ```python from datasets import load_dataset ds = load_dataset("V1rtucious/ecom-chatbot-train-data") train = ds["train"] test = ds["test"] # Filter by response type tool_call_examples = train.filter(lambda x: x["response_type"] == "tool_call") # Filter by intent order_queries = train.filter(lambda x: x["intent"] == "order_status") ``` --- ## Split Methodology Both splits were produced using **prompt-group-level stratified sampling** to ensure zero contamination, maximum variance, and minimum bias: - **Stratification key:** `source | response_type | intent | difficulty` - **Splitting unit:** unique `(source, prompt)` groups — all records sharing a prompt are assigned atomically to one split - **40,949 prompt groups** across 44,031 records; 3,082 records share a prompt with at least one other record - **Fallback cascade** for rare strata (< 5 groups): drops `difficulty`, then drops to `source` only - **113 unique strata** | **Random seed:** 42 (reproducible) - **Prompt contamination between splits: 0** (verified post-split) Distribution drift between train and test is < 0.35% across all key columns. --- ## License This dataset is released under the **MIT License**. Individual source data may carry additional terms from their original providers (Amazon, ASOS, Bitext).

annotations_creators: - 机器生成（machine-generated） - 专家生成（expert-generated） language_creators: - 机器生成（machine-generated） - 公开获取（found） language: - 英语（en） license: MIT许可证（MIT） multilinguality: - 单语言（monolingual） size_categories: - 10K<n<100K source_datasets: - 原始数据集（original） task_categories: - 问答（question-answering） - 文本生成（text-generation） - 其他（other） task_ids: - 对话建模（dialogue-modeling） - 开放域问答（open-domain-qa） pretty_name: 电子商务聊天机器人训练数据（E-commerce Chatbot Training Data） tags: - 电子商务（ecommerce） - 聊天机器人（chatbot） - 工具使用（tool-use） - 客户支持（customer-support） - 零售（retail） - 对话式AI（conversational-ai） # 电子商务聊天机器人训练数据集（E-commerce Chatbot Training Data）本数据集为经精心整理的多源数据集，用于训练与评估电子商务对话式AI（conversational AI）系统。其覆盖16余个产品领域，涵盖丰富的用户意图，从商品发现、订单管理，到退换货、工具增强型回复，以及基于检索增强生成（Retrieval-Augmented Generation，RAG）的问答任务。 ## 数据集概览 | 拆分布局 | 样本数量 | |-------|---------| | 训练集 | 35,213 | | 测试集 | 8,818 | | **总计** | **44,031** | 本次训练集与测试集的划分采用**基于提示词组的分层抽样（prompt-group-level stratified sampling）**，抽样维度为`数据源 × 回复类型 × 意图 × 难度`，以确保两个拆分集合的分布完全一致，且训练集与测试集之间无任何提示词污染（prompt contamination）。 --- ## 数据来源 | 数据源 | 样本数量 | 回复类型 | 覆盖领域 | 意图数量 | |--------|---------|----------------|---------|---------| | `synthetic_api_generated` | 3,933 | 文本、工具调用、混合 | 12个 | 19种 | | `asos_ecom_dataset` | 2,000 | 仅文本 | 时尚领域 | 相似度搜索 | | `bitext_customer_support` | 5,000 | 工具调用、混合 | 通用领域 | 6种 | | `bitext_retail_ecom` | 4,998 | 文本、工具调用 | 通用领域 | 多种 | | `amazon_reviews_2023_*` | 23,100 | 仅文本 | 16个领域 | 4种 | | `amazon_meta_2023_*` | 5,000 | 仅文本 | 9个领域 | 4种 | --- ## 数据 Schema | 字段名 | 数据类型 | 字段说明 | |-------|------|-------------| | `id` | 字符串（string） | 唯一记录ID（示例：`ecomm_a1b2c3`） | | `source` | 字符串 | 原始数据集/处理流水线来源 | | `group` | 字符串 | 回复分组：`A`（工具调用）、`B`（文本回复）、`C`（混合回复） | | `difficulty` | 整数（int） | 任务难度：`1`（简单）至`3`（困难） | | `system` | 字符串 | 分配给助手的系统提示词 | | `history` | JSON格式字符串 | 历史对话轮次，格式为`[{"role": ..., "content": ...}]` | | `prompt` | 字符串 | 当前用户输入的消息 | | `context` | JSON格式字符串 | 检索到的文档、用户个人资料、购物车/订单状态 | | `tools` | JSON格式字符串 | 可用工具/函数定义 | | `response_type` | 字符串 | 回复类型：`text`（文本）、`tool_call`（工具调用）或`mixed`（混合） | | `response` | 字符串 | 助手的真实标注回复 | | `language` | 字符串 | ISO语言代码（示例：`en`） | | `locale` | 字符串 | 区域设置（示例：`en-US`） | | `annotator` | 字符串 | 注释来源（示例：`gemini_synthetic`、`bitext`、`amazon_user`） | | `quality_score` | 浮点数（float） | 注释质量评分（范围0至1） | | `domain` | 字符串 | 产品领域（示例：`electronics`（电子产品）、`fashion`（时尚）、`grocery_food`（食品杂货）） | | `intent_category` | 字符串 | 高级意图类别（示例：`product_discovery`（商品发现）、`order_management`（订单管理）） | | `intent` | 字符串 | 细粒度意图（共19种，示例：`order_status`（订单状态查询）、`return_refund`（退款退货）） | | `sub_intent` | 字符串 | 进一步细分的子意图（示例：`track_delivery`（物流追踪）、`refund_timeline`（退款进度查询）） | | `capability` | 字符串 | 模型能力标签（适用于相关场景） | | `test_tier` | 字符串 | 评估层级标签（适用于相关场景） | --- ## 意图分类本数据集涵盖7个高级类别下的19种细粒度意图： | 高级意图类别 | 细粒度意图 | |----------|---------| | 商品发现（Product Discovery） | `product_search`（商品搜索）、`product_detail_qa`（商品详情问答）、`product_comparison`（商品对比）、`similarity_search`（相似度搜索）、`bundle_suggestions`（套餐推荐）、`gift_recommendation`（礼品推荐）、`personalized_recommendations`（个性化推荐） | | 订单管理（Order Management） | `order_status`（订单状态查询）、`order_cancellation`（订单取消）、`reorder_assistance`（重新下单协助） | | 退换货与换货（Returns & Exchanges） | `return_refund`（退款退货）、`exchange_request`（换货申请） | | 购物车与结账（Cart & Checkout） | `cart_management`（购物车管理）、`payment_issues`（支付问题） | | 客户支持（Customer Support） | `complaint_handling`（投诉处理）、`human_handoff`（转人工客服）、`faq_answering`（常见问题解答） | | 账户管理（Account） | `account_management`（账户管理） | | 库存查询（Inventory） | `stock_availability`（库存可用性查询） | --- ## 产品覆盖领域 `appliances`（家电）、`beauty`（美妆）、`books_media`（图书音像）、`electronics`（电子产品）、`fashion`（时尚服饰）、`gaming`（游戏）、`garden_outdoor`（园艺户外）、`grocery_food`（食品杂货）、`home_kitchen`（家居厨房）、`industrial`（工业用品）、`pet_supplies`（宠物用品）、`sports_outdoors`（运动户外）、`automotive`（汽车用品）、`baby`（母婴）、`health`（健康保健）、`office`（办公用品）、`toys_games`（玩具游戏） --- ## 使用方法 python from datasets import load_dataset ds = load_dataset("V1rtucious/ecom-chatbot-train-data") train = ds["train"] test = ds["test"] # 按回复类型筛选 tool_call_examples = train.filter(lambda x: x["response_type"] == "tool_call") # 按意图筛选 order_queries = train.filter(lambda x: x["intent"] == "order_status") --- ## 拆分方法说明本次拆分采用**基于提示词组的分层抽样**方法，以确保无提示词污染、样本方差最大化与偏差最小化： - **分层维度**：`数据源 | 回复类型 | 意图 × 难度` - **拆分单元**：唯一的`(数据源, 提示词)`组 — 所有共享同一提示词的记录会被整体分配至同一个拆分集合 - 44,031条样本共包含40,949个提示词组；其中3,082条样本与至少一条其他样本共享同一提示词 - **稀有分层降级策略**：对于样本量小于5个组的稀有分层，依次移除`难度`维度，最终仅保留`数据源`维度 - 共113个唯一分层 | **随机种子**：42（可复现） - 拆分集合间的提示词污染率为0（拆分后已验证）训练集与测试集在所有关键列上的分布漂移率均小于0.35%。 --- ## 许可证本数据集采用**MIT许可证**进行发布。各原始数据源可能附带其原始提供方（Amazon（亚马逊）、ASOS、Bitext）的额外条款。

提供机构：

rescommons

5,000+

优质数据集

54 个

任务类型

进入经典数据集