Privacy-Safe Synthetic Data in Unity Catalog for Financial Services
收藏Databricks2025-02-14 收录
下载链接:
https://marketplace.databricks.com/details/32db0d45-89ad-4159-b15c-f031d981998a/MOSTLY-AI_Privacy-Safe-Synthetic-Data-in-Unity-Catalog-for-Financial-Services
下载链接
链接失效反馈官方服务:
资源简介:
## Overview
This Databricks Solution Accelerator enables financial institutions to generate high-quality, privacy-preserving synthetic data using the **open-source Synthetic Data SDK from MOSTLY AI**—all within **Databricks**.
By leveraging **Unity Catalog** for secure data management and governance, this accelerator helps financial services teams unlock access to **AI/ML training, analytics, fraud detection, and risk modeling—without exposing sensitive customer information**.
---
## 📌 How to Get Started
To use this accelerator, **click "Get Instant Access"** in the top right.
- The **included notebooks** will appear in the **"Other assets"** section of the **Delta Share**, where you can **clone them into your own workspace**.
- The **financial data tables** will also be accessible in **Unity Catalog** via the Delta Share, allowing you to begin generating synthetic data immediately.
---
## 🔹 Use Cases
Financial institutions can leverage this accelerator for multiple synthetic data applications, including:
✔ **AI/ML Model Training** – Use privacy-preserving synthetic data to train **fraud detection, credit risk, and predictive analytics models**.
✔ **Self-Service Data Access** – Enable analysts and data scientists to work with **high-quality data while keeping real customer information protected**.
✔ **Regulatory Compliance & Data Sharing** – Securely share **financial data across teams, partners, and regulatory bodies** while maintaining full compliance.
✔ **Software Testing & Application Development** – Use synthetic data for **developing and testing banking applications** without production data risks.
---
## 🔹 Product Details
This accelerator provides a **streamlined, end-to-end synthetic data pipeline** for **financial services**. It consists of **two fully parameterized, widget-driven notebooks** that allow users to:
📌 **Step 1**: **Train a synthetic data generator** on their financial data
📌 **Step 2**: **Generate synthetic datasets** that retain key relationships
📌 **All results are securely stored in Unity Catalog**
---
## 📂 Included Financial Data: Accounts & Transactions
This accelerator includes **preloaded financial data in Delta Share**, ensuring users can **immediately** train and generate synthetic data **without needing external datasets**.
The included tables are derived from the **Berka dataset**, a well-known **public banking dataset** that has been **preprocessed for seamless integration into Unity Catalog**:
🔹 **Accounts** – Represents **customer bank accounts**, including **creation dates** and **assigned districts**.
🔹 **Transactions** – Logs **deposits, withdrawals, and transfers** across accounts.
This structure mirrors **real-world banking data**, making it highly relevant for financial institutions needing synthetic data for:
✔ **AI/ML model training**
✔ **Data democratization**
✔ **Regulatory compliance**
💡 **Performance Tip:**
This dataset includes **over 1 million transactions**.
To ensure **quick execution while maintaining model performance**, we recommend setting the **Max Training Time** to **10 minutes**.
---
## 🛠️ Included Notebooks
### **📌 Step 1: Train a Synthetic Data Generator**
✔ Load financial data from **Unity Catalog**
✔ Train a synthetic data generator using the **open-source Synthetic Data SDK**
✔ Export the generator **to Unity Catalog Volumes**
### **📌 Step 2: Generate Synthetic Data**
✔ Use the trained generator to **produce synthetic financial datasets**
✔ Preserve **key relationships** and store the results in **Unity Catalog**
---
## 💡 Key Benefits for Financial Services
✅ **Privacy-Preserving Data** – Enables **AI/ML model training** without exposing real customer data, ensuring compliance with **GDPR, CCPA, and other regulations**.
✅ **No-Code, User-Friendly Setup** – Fully **widget-driven**, allowing easy setup and execution **without coding experience**.
✅ **Unity Catalog Integration** – Ensures **governance, auditability, and discoverability** across teams in Databricks.
✅ **Self-Service Data Access** – Democratizes **data access** across finance teams without manual anonymization.
✅ **AI/ML Model Training & Risk Modeling** – Ideal for **fraud detection, credit risk modeling, and predictive analytics**.
✅ **Reusable Generators** – Once trained, a **synthetic data generator can be reused indefinitely**, reducing the need for re-training.
✅ **Multi-Table Support** – Automatically preserves **primary and foreign key relationships** across datasets.
---
## 🔥 Scaling Beyond Open Source
While the **open-source Synthetic Data SDK** provides powerful synthetic data generation, organizations looking to scale can unlock:
✔ **Enterprise-grade observability & governance** – Gain **centralized visibility into synthetic data generation, usage, and performance**.
✔ **Robust professional services & support** – Leverage **expert guidance** for **implementation, compliance, and best practices**.
📩 **Interested in scaling synthetic data across your organization?**
Contact **[hello@mostly.ai](mailto:hello@mostly.ai)** to discuss **enterprise solutions**.
## 概述
此Databricks解决方案加速器可帮助金融机构在Databricks平台内,借助**MOSTLY AI开源合成数据软件开发工具包(Synthetic Data SDK)**生成高质量、隐私保护型合成数据。
通过利用**统一数据目录(Unity Catalog)**实现安全的数据管理与治理,该加速器可助力金融服务团队解锁AI/ML训练、数据分析、欺诈检测与风险建模等场景的使用权限,同时无需暴露敏感客户信息。
---
## 📌 快速入门
若要使用此加速器,请点击页面右上角的「立即获取访问权限」按钮。
- 所包含的笔记本将显示在**Delta共享(Delta Share)**的「其他资产」板块中,您可将其克隆至您自己的工作区。
- 金融数据表也可通过Delta共享(Delta Share)在统一数据目录(Unity Catalog)中访问,让您能够立即开始生成合成数据。
---
## 🔹 应用场景
金融机构可借助此加速器开展多种合成数据应用,具体包括:
✔ **AI/ML模型训练**:使用隐私保护型合成数据训练欺诈检测、信用风险与预测分析模型。
✔ **自助式数据访问**:让分析师与数据科学家能够使用高质量数据,同时保护真实客户信息不被泄露。
✔ **监管合规与数据共享**:在完全符合监管要求的前提下,安全地在团队、合作伙伴与监管机构之间共享金融数据。
✔ **软件测试与应用开发**:使用合成数据开发和测试银行应用程序,规避生产数据使用风险。
---
## 🔹 产品详情
此加速器为金融服务行业提供了一套精简的端到端合成数据流水线,包含两个完全参数化、支持小部件(widget)驱动的笔记本,可让用户完成以下操作:
📌 **步骤1**:基于自身金融数据训练合成数据生成器
📌 **步骤2**:生成保留关键关联关系的合成数据集
所有结果均安全存储于统一数据目录(Unity Catalog)中。
---
## 📂 包含的金融数据:账户与交易记录
此加速器在Delta共享(Delta Share)中预置了金融数据,确保用户无需外部数据集即可立即开展合成数据的训练与生成工作。
所包含的数据表源自知名的公开银行数据集**Berka数据集**,该数据集已经过预处理,可无缝集成至统一数据目录(Unity Catalog):
🔹 **账户表**:记录客户银行账户信息,包括账户创建日期与所属行政区。
🔹 **交易记录表**:记录各账户间的存款、取款与转账信息。
该数据结构贴合真实银行数据场景,可广泛满足金融机构开展以下合成数据相关需求:
✔ AI/ML模型训练
✔ 数据民主化
✔ 监管合规
💡 **性能提示**:
此数据集包含超过100万条交易记录。
为确保在保证模型性能的同时实现快速执行,建议将**最大训练时长**设置为10分钟。
---
## 🛠️ 包含的笔记本
### 📌 步骤1:训练合成数据生成器
✔ 从统一数据目录(Unity Catalog)加载金融数据
✔ 使用MOSTLY AI开源合成数据软件开发工具包(Synthetic Data SDK)训练合成数据生成器
✔ 将生成器导出至统一数据目录卷(Unity Catalog Volumes)
### 📌 步骤2:生成合成数据
✔ 使用已训练完成的生成器生成合成金融数据集
✔ 保留关键关联关系,并将结果存储至统一数据目录(Unity Catalog)
---
## 💡 金融服务行业核心优势
✅ **隐私保护型数据**:可在不暴露真实客户数据的前提下开展AI/ML模型训练,确保符合GDPR、CCPA及其他监管法规要求。
✅ **无需编码、易于上手**:完全采用小部件驱动,无需具备编码经验即可轻松完成设置与执行。
✅ **统一数据目录(Unity Catalog)集成**:确保Databricks平台内各团队的数据治理、可审计性与可发现性。
✅ **自助式数据访问**:无需手动执行匿名化操作,即可实现金融团队间的数据民主化共享。
✅ **AI/ML模型训练与风险建模**:适用于欺诈检测、信用风险建模与预测分析等场景。
✅ **可重复使用的生成器**:训练完成后,合成数据生成器可无限次重复使用,减少重复训练的需求。
✅ **多表支持**:自动保留数据集间的主键与外键关联关系。
---
## 🔥 突破开源局限的扩展能力
尽管开源合成数据软件开发工具包(Synthetic Data SDK)已具备强大的合成数据生成能力,但寻求规模化应用的企业还可获得以下额外功能:
✔ **企业级可观测性与治理能力**:获取合成数据生成、使用与性能的集中式可视化视图。
✔ **完善的专业服务与支持**:获得专家指导,助力实施落地、合规性保障与最佳实践落地。
📩 **希望在您的组织内规模化推广合成数据应用?**
请联系**[hello@mostly.ai](mailto:hello@mostly.ai)** 洽谈企业级解决方案。
提供机构:
MOSTLY AI



