2025 BNPL Risk & NLP Premium Dataset - SAMPLE
收藏Databricks2026-01-17 收录
下载链接:
https://marketplace.databricks.com/details/1ba7f7ec-d99f-4e41-9d12-1004ff600b4a/AIDC-Inc-_2025-BNPL-Risk-&-NLP-Premium-Dataset---SAMPLE
下载链接
链接失效反馈官方服务:
资源简介:
**Sample:** 100 rows
**Total Rows:** ~25,000
## **Overview**
The **2025 BNPL Risk & NLP Premium Dataset** is a high-fidelity, privacy-compliant synthetic dataset purpose-built for modern fintech risk modeling. Buy Now, Pay Later (BNPL) risk refers to the credit, fraud, and behavioral risks unique to short-term installment financing products. BNPL risk is not purely numeric. A large portion of risk signals live in language.
Natural Language Processing (NLP) enables models to extract risk-relevant signals from text sources that traditional credit models ignore or underutilize. This dataset captures **25,000 Buy Now Pay Later transactions** with a strong focus on *thin-file borrowers*, particularly Gen Z and Millennial consumers—a segment often underrepresented in traditional credit datasets.
Unlike conventional tabular credit files, this dataset enables **multimodal AI workflows** by combining structured financial attributes (income, loan terms, FICO proxy scores) with **unstructured customer support text logs**. This design allows data teams to train, validate, and benchmark NLP-enhanced risk models, sentiment-aware default prediction systems, and next-generation underwriting pipelines—without regulatory friction.
## **Provenance**
This dataset is **100% synthetic**, generated based on real-world data patterns. No real customer records were used. The synthetic generation approach eliminates personally identifiable information (PII) and removes the need for consent management or anonymization workflows, enabling safe experimentation and rapid deployment.
## **Use Cases**
This dataset enables high-value analytics and ML workflows on Databricks:
* **Multimodal Credit Risk Modeling:** Combine tabular risk features with NLP sentiment signals to improve default prediction accuracy.
* **NLP-Driven Distress Detection:** Train models to detect early financial stress using customer support language patterns.
* **Thin-File Underwriting Research:** Prototype alternative credit scoring approaches for Gen Z and underbanked consumers.
* **Model Validation & Benchmarking:** Stress-test imbalance-aware ML pipelines against a realistic 5.4% default rate.
* **Feature Engineering at Scale:** Develop and test cross-modal feature pipelines using Spark, MLflow, and Delta Lake.
## **Column Dictionary**
* **Transaction_ID** (string): Unique synthetic identifier for each BNPL transaction
* **User_ID** (string): Unique synthetic customer identifier
* **Transaction_Date** (string): Date of the BNPL transaction
* **Customer_Age** (integer): Age of the customer at transaction time
* **Customer_State** (string): Customer’s U.S. state of residence
* **Job_Title** (string): Synthetic occupation descriptor
* **Annual_Income** (integer): Geo-adjusted annual income estimate
* **FICO_Proxy_Score** (integer): Synthetic proxy for credit score
* **Merchant_Name** (string): Merchant associated with the BNPL purchase
* **Category** (string): Purchase category
* **Loan_Amount** (integer): Principal BNPL loan amount
* **Term_Length_Months** (integer): Repayment term length
* **Interest_Rate** (decimal): Applied interest rate
* **Bureau_Inquiries_Last_12m** (integer): Credit inquiries in last 12 months
* **Default_Flag** (integer): Binary indicator of default
* **Customer_Support_Log** (string): Synthetic text interactions (emails, chat, call notes)
## **Keywords**
BNPL, credit risk, synthetic data, fintech, NLP, multimodal AI, default prediction, thin-file borrowers
提供机构:
AIDC, Inc.



