Simulated Retail Customer Data

Name: Simulated Retail Customer Data
Creator: Databricks
License: 暂无描述

Databricks2024-09-25 收录

下载链接：

https://marketplace.databricks.com/details/a82597f6-5ada-49d5-b934-d6c9dece16a1/Databricks_Simulated-Retail-Customer-Data

下载链接

链接失效反馈

官方服务：

资源简介：

**Overview** This collection provides a comprehensive set of synthetic retail datasets designed for hands-on training with Lakeflow Spark Declarative Pipelines and general data ingestion patterns. All data is fully fictional and formatted as `CSV` or `JSON` files to mirror common real-world retail ingestion scenarios. This share contains two schemas with tables and volumes: - **v01** - **v02** **NOTE:** All data is synthetically generated and does not represent real customers or transactions. **v01 - Schema** This schema contains three tables: **customers**, **sales**, and **sales_orders**. Each representing core records of a fictional retail company. These datasets accompany the *Get Started with Databricks for Data Analysis* course on Databricks Academy. **Tables** - **customers** – Customers located in the US who purchase finished goods. - **sales** – Individual item-level sales transactions. - **sales_orders** – The originating purchase orders associated with each customer transaction. **Volumes** **source_files** Contains the raw `CSV` files used to build the three core tables. - `customers.csv` - `sales.csv` - `sales_orders.csv` **retail-pipeline** Simulates a streaming data landing zone with three subdirectories: - **customers** – `JSON` files representing new, updated, or deleted customers. - **orders** – `JSON` files representing order activity. - **status** – `JSON` files representing order status updates. **v02 - Schema** This schema contains seven days of order drops for three fictional retail subsidiaries, spanning **2025-11-01 through 2025-11-07**. Each subsidiary contributes one file per day, using its own format and structure. The dataset is designed for workshops using Lakeflow Spark Declarative Pipelines (SDP) to demonstrate ingestion with Auto Loader, schema normalization, and other core SDP features. It is also suitable for general ingestion and processing from cloud storage. **Volumes** **subsidiary_daily_orders** This volume contains three directories, each simulating daily order drops: - **bright_home_orders** – Daily `CSV` files with home goods order activity. - **lumina_sports_orders** – Daily `CSV` files for outdoor and athletic gear purchases. - **northstar_outfitters_orders** – Daily `JSON` files for camping and travel product orders. **business_daily_events** Simulates a unified business event stream used across subsidiaries. Each day contains a mixed set of operational events stored as a single `JSON` file. These files are intended for showcasing multiplex routing in Lakeflow Spark Declarative Pipelines and include three distinct event groups: - Store operations - Marketing campaign activity - Logistics and fulfillment events **customer_changes_daily** Contains seven days of customer change events in `JSON` format, including new signups, profile updates, and deletions. Customers first appear on the day they are observed in real subsidiary orders, providing a CDC feed aligned with downstream SCD Type 2 processing and analytics.

**概述** 本数据集集合提供了一套全面的合成零售数据集，旨在用于Lakeflow Spark 声明式管道（Lakeflow Spark Declarative Pipelines）实操训练以及通用数据摄入模式学习。所有数据均为完全虚构内容，格式为`CSV`或`JSON`文件，旨在还原真实零售场景下常见的数据摄入流程。本共享资源包含两个带有数据表和数据卷的架构： - **v01** - **v02** **注意：** 所有数据均为合成生成，不代表任何真实客户或交易。 **v01 - 架构** 该架构包含三张数据表：**customers（客户表）**、**sales（销售表）**和**sales_orders（销售订单表）**，分别对应一家虚构零售企业的核心业务记录。本数据集配套Databricks学院推出的*《Databricks数据分析入门》*课程使用。 **数据表** - **customers** – 美国境内购买成品的客户信息。 - **sales** – 单条商品级销售交易记录。 - **sales_orders** – 与每笔客户交易相关的原始采购订单。 **数据卷** **source_files（源文件卷）** 包含用于构建上述三张核心数据表的原始`CSV`文件： - `customers.csv` - `sales.csv` - `sales_orders.csv` **retail-pipeline（零售管道卷）** 模拟流式数据落地区，包含三个子目录： - **customers** – 用于表示客户新增、更新或删除操作的`JSON`文件。 - **orders** – 用于表示订单活动的`JSON`文件。 - **status** – 用于表示订单状态更新的`JSON`文件。 **v02 - 架构** 该架构包含三家虚构零售子公司2025年11月1日至2025年11月7日共7天的订单交付数据。每家子公司每日生成一份文件，且采用各自独立的格式与结构。本数据集专为使用Lakeflow Spark 声明式管道（SDP，Lakeflow Spark Declarative Pipelines）的工作坊设计，用于演示自动加载器（Auto Loader）、架构规范化等核心SDP功能，同时也适用于云存储环境下的通用数据摄入与处理任务。 **数据卷** **subsidiary_daily_orders（子公司每日订单卷）** 本卷包含三个目录，分别模拟每日订单交付场景： - **bright_home_orders（bright_home家居订单目录）** – 存储家居用品订单活动的每日`CSV`文件。 - **lumina_sports_orders（lumina_sports运动用品订单目录）** – 存储户外及运动装备采购活动的每日`CSV`文件。 - **northstar_outfitters_orders（northstar_outfitters户外装备订单目录）** – 存储露营及旅行产品订单的每日`JSON`文件。 **business_daily_events（企业每日事件卷）** 模拟跨子公司使用的统一业务事件流。每日包含一组混合的运营事件，存储为单个`JSON`文件。这些文件旨在展示Lakeflow Spark 声明式管道中的多路路由功能，包含三类不同的事件组： - 门店运营事件 - 营销活动动态 - 物流与履约事件 **customer_changes_daily（客户每日变更卷）** 包含7天的客户变更事件`JSON`文件，涵盖新用户注册、档案更新及账户删除操作。客户数据首次出现的日期与其在对应子公司订单中被记录的日期保持一致，提供了与下游缓慢变化维类型2（SCD Type 2）处理及分析相匹配的变更数据捕获（Change Data Capture，CDC）馈送。

提供机构：

Databricks

搜集汇总

数据集介绍

背景与挑战

背景概述

该数据集为模拟零售客户数据的合成数据集，包含v01和v02两个版本，分别提供客户、销售和订单等核心业务数据表，以及不同格式的原始文件。数据适用于Lakeflow Spark声明式管道的实践训练和一般数据处理教学，所有内容均为虚构。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集