Cookies Dataset DAIS 2024 🍪 🍪 🍪
收藏Databricks2024-09-17 收录
下载链接:
https://marketplace.databricks.com/details/f8498740-31ea-49f8-9206-1bbf533f3993/Databricks_Cookies-Dataset-DAIS-2024-🍪-🍪-🍪
下载链接
链接失效反馈官方服务:
资源简介:
**Overview Cookies Dataset Data+AI Summit 2024**
Welcome to the Bakehouse cookie 🍪 🍪 🍪 dataset created for the Data + AI Summit 2024 keynotes. This synthetic dataset can be used for any Databricks workload and helps you recreate the demos presented at DAIS.
*Note*: June 3rd, 2025: Resolved data inconsistencies in customerID and transactionID fields in the transaction table.
**Purpose**
The dataset simulates a bakery franchise business, providing a practical example for:
* Building data pipelines with Delta Live Tables
* Performing analytics with Databricks SQL
* Exploring AI and machine learning capabilities
**List of Ingredients 🍪**
The data is provided in the *bakehouse* catalog.
**bakehouse.sales** contains all sales-related tables:
* **bakehouse.sales.franchises**: This dataset is sourced directly from Salesforce using LakeFlow Connectors. It contains vital information about our existing franchises, including performance metrics, location details, and market penetration data.
* **bakehouse.sales.transactions**: This information, which comprises online sales data, is extracted from SQL Server via LakeFlow Connector. It offers insights into customer purchasing patterns, popular products, and revenue trends across different regions.
* **Volume: suppliers_xml**
Stored in a Unity Catalog managed volume, this data is in XML format and is ingested using Auto Loader. It provides detailed information about our suppliers and the local ingredients they offer, which is crucial for understanding regional product availability and supply chain logistics.
**bakehouse.media** contains customer reviews for sentiment analysis:
* **bakehouse.media.customer_reviews**:
The 'customer_reviews' table contains customer reviews for various bakery franchises. It includes details such as the review text, the date it was posted, and the franchise's unique identifier. Use the data to monitor customer satisfaction levels, identify customer feedback trends, and track customer opinion changes over time.
* **bakehouse.media.gold_reviews_chunked:**
The 'gold_reviews_chunked' table contains reviews from various franchises, broken down into smaller chunks. It includes details such as the review date, the chunked text, and the corresponding review URL. Use the data to analyze reviews, identify common themes, and track trends in customer feedback.
**Use Case: Data Engineering**
* Watch the new [Databricks Workflows demo recording](https://databricks.com/resources/demos/videos/data-engineering/databricks-workflows) to get an overview of the Data Intelligence Platform and this dataset. The demo covers Delta Live Tables, DB SQL and Notebook tasks, AI functions and Lakeview dashboards.
* Explore the [DLT Getting Started Tutorial ](https://www.databricks.com/discover/pages/getting-started-with-delta-live-tables) with a step-by-step guide using another DLT example.
* The attached DLT notebook contains the SQL pipeline used in the example above. Once you completed the Getting Started Guide, run your own DLT 🍪 pipeline in SQL.
* The new LakeFlow platform will unify DLT and Workflows into a single canvas without any migration effort for existing data engineering workloads. Get the first [glimpse of LakeFlow here](https://www.databricks.com/resources/demos/videos/lakeflow-demo?itm_data=demo_center).
**Use Case: DB SQL**
* Watch the DAIS keynote where this [dataset is used](https://youtu.be/UcdFPRT_sG8?list=PLTPXxbhUt-YXh57uL3QUQiKLwTf39IoT1&t=959)
* Take a challenge. Open the SQL editor and answer, "How many Pearly Pies are sold in total?"
**Use Case: AI**
* Watch the famous [DAIS keynote](https://youtu.be/acIBdCOj88E?list=PLTPXxbhUt-YXh57uL3QUQiKLwTf39IoT1&t=1204), in which Kasey demonstrated compound AI using the cookie data set.
**Use Case: AI/BI**
* Use this dataset to explore natural language queries with Genie.
**数据集概览:2024年数据与人工智能峰会(Data+AI Summit 2024)曲奇数据集**
欢迎使用专为2024年数据与人工智能峰会(Data+AI Summit,以下简称DAIS)主题演讲打造的烘焙坊(Bakehouse)曲奇数据集。该合成数据集可适配任意Databricks工作负载,助力复现本次峰会上展示的演示案例。
*注*:2025年6月3日:已修复交易表中customerID与transactionID字段的数据不一致问题。
**数据集用途**
本数据集模拟烘焙连锁店业务场景,可作为以下场景的实操案例:
* 构建基于Delta Live Tables的数据管道
* 使用Databricks SQL开展数据分析
* 探索人工智能与机器学习能力
**数据集组成 🍪**
所有数据均存储于`bakehouse`数据目录中。
`bakehouse.sales`包含所有与销售相关的表:
* **`bakehouse.sales.franchises`**:该数据集通过LakeFlow Connectors直接从Salesforce同步获取,收录了现有烘焙连锁店的核心信息,包括运营指标、门店位置详情与市场渗透率数据。
* **`bakehouse.sales.transactions`**:该数据集通过LakeFlow Connector从SQL Server抽取得到,涵盖线上销售数据,可用于分析不同区域的客户购买偏好、热销产品与营收趋势。
* **存储卷:suppliers_xml**
该数据以XML格式存储于Unity Catalog托管存储卷中,通过Auto Loader完成数据摄入。其收录了供应商及其提供的本地原料的详细信息,有助于分析区域产品供应情况与供应链物流状况。
`bakehouse.media`包含用于情感分析的客户评论数据:
* **`bakehouse.media.customer_reviews`**:该表收录了各烘焙连锁店的客户评论,包含评论内容、发布日期与连锁店唯一标识符等信息。可利用该数据监控客户满意度、识别客户反馈趋势,并追踪随时间推移的客户观点变化。
* **`bakehouse.media.gold_reviews_chunked`**:该表收录了各连锁店的评论,并将其拆分为更小的文本块,包含评论日期、分块文本与对应评论链接等信息。可利用该数据开展评论分析、识别共性主题,并追踪客户反馈趋势。
**应用场景:数据工程**
* 观看全新的[Databricks Workflows演示录像](https://databricks.com/resources/demos/videos/data-engineering/databricks-workflows),可了解数据智能平台与本数据集的整体概览。该演示涵盖Delta Live Tables(DLT)、DB SQL与Notebook任务、AI函数及Lakeview仪表板等内容。
* 可参考[Delta Live Tables(DLT)入门教程](https://www.databricks.com/discover/pages/getting-started-with-delta-live-tables),该教程通过另一个DLT示例提供了分步操作指南。
* 附件中的DLT Notebook包含了上述示例中使用的SQL数据管道。完成入门指南后,您即可通过SQL运行自定义的DLT曲奇数据管道🍪。
* 全新的LakeFlow平台可将DLT与Workflows整合至统一的操作画布中,无需对现有数据工程工作负载进行任何迁移操作。[点击此处抢先了解LakeFlow](https://www.databricks.com/resources/demos/videos/lakeflow-demo?itm_data=demo_center)。
**应用场景:DB SQL**
* 观看本次峰会中使用本数据集的[DAIS主题演讲录像](https://youtu.be/UcdFPRT_sG8?list=PLTPXxbhUt-YXh57uL3QUQiKLwTf39IoT1&t=959)
* 挑战一下:打开SQL编辑器,回答「总共售出了多少个Pearly Pies?」
**应用场景:人工智能**
* 观看知名的[DAIS主题演讲录像](https://youtu.be/acIBdCOj88E?list=PLTPXxbhUt-YXh57uL3QUQiKLwTf39IoT1&t=1204),其中Kasey通过本曲奇数据集演示了复合人工智能应用。
**应用场景:人工智能与商业智能(AI/BI)**
* 利用本数据集探索Genie的自然语言查询功能。
提供机构:
Databricks
搜集汇总
数据集介绍

背景与挑战
背景概述
该数据集是一个为Data + AI Summit 2024创建的合成数据集,模拟烘焙连锁业务,用于演示数据工程、SQL分析和AI应用。数据集包含销售数据(如加盟店信息、交易记录)和媒体数据(如客户评论),支持Delta Live Tables、Databricks SQL和机器学习工作负载,覆盖全球范围。
以上内容由遇见数据集搜集并总结生成



