azizstark/synthetic-chargeback-cases

Name: azizstark/synthetic-chargeback-cases
Creator: azizstark
Published: 2026-04-20 20:42:55
License: 暂无描述

Hugging Face2026-04-20 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/azizstark/synthetic-chargeback-cases

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit task_categories: - tabular-classification language: - en tags: - chargeback - fraud-detection - fintech - payments - credit-card - xgboost - synthetic - representment - dispute-resolution pretty_name: Synthetic Chargeback Cases for Representment Win Prediction size_categories: - 1K<n<10K configs: - config_name: default data_files: - split: train path: chargeback_cases_train.csv - split: validation path: chargeback_cases_val.csv - split: test path: chargeback_cases_test.csv - config_name: full data_files: - split: train path: chargeback_cases_full.csv - config_name: reason_codes data_files: - split: train path: reason_codes_reference.csv --- # Synthetic Chargeback Cases for Representment Win Prediction ## Dataset Description A synthetic dataset of **10,000 credit card chargeback cases** designed for training and evaluating machine learning models that predict **representment win probability** — the likelihood a merchant will win if they fight a chargeback dispute. The dataset models realistic chargeback workflows including evidence collection, reason code categorization, customer/merchant profiling, and outcome determination — suitable for building AI-assisted analyst tools that summarize cases, assemble evidence, and provide policy-aligned recommendations (REPRESENT / REFUND / ESCALATE). ### Supported Tasks - **Binary Classification**: Predict whether a merchant will win (`representment_won = 1`) or lose (`representment_won = 0`) a chargeback representment. - **Probability Estimation**: Estimate calibrated win probability for prioritizing cases. - **Explainability Research**: Analyze which evidence types and case features drive outcomes using SHAP or similar methods. ### Languages English (all text fields are in English). ## Dataset Summary | Property | Value | |---|---| | Total cases | 10,000 | | Represented cases (with outcome) | 7,804 | | Accepted cases (no outcome) | 2,196 | | Date range | Jan 2024 – Dec 2025 | | Card networks | Visa (65.6%), Mastercard (34.4%) | | Reason codes | 15 (Visa + Mastercard) | | Chargeback categories | 4 (Fraud, Consumer Dispute, Authorization, Processing Error) | | Total columns | 65 | | Overall win rate (represented) | ~69.7% | | Transaction amount range | $10.00 – $6,000.00 | | Currencies | USD (82.6%), EUR (9.3%), GBP (4.8%), CAD (3.3%) | ## Dataset Structure ### Data Splits The dataset is split **by time** (not randomly) to simulate real-world deployment where models predict future cases from past data: | Split | File | Rows | Time Period | Purpose | |---|---|---|---|---| | `train` | `chargeback_cases_train.csv` | 4,682 | Jan 2024 – ~Jul 2025 | Model training | | `validation` | `chargeback_cases_val.csv` | 1,561 | ~Aug 2025 – ~Oct 2025 | Hyperparameter tuning | | `test` | `chargeback_cases_test.csv` | 1,561 | ~Nov 2025 – Dec 2025 | Final evaluation | | `full` | `chargeback_cases_full.csv` | 10,000 | Jan 2024 – Dec 2025 | All cases (including non-represented) | > **Note:** The train/val/test splits contain only **represented** cases (where `case_disposition == "REPRESENTED"`) and have a binary `representment_won` label. The full dataset also includes **accepted** cases (where the merchant did not fight the chargeback), which have `NaN` for `representment_won`. ### Additional Files | File | Description | |---|---| | `reason_codes_reference.csv` | Lookup table for all 15 Visa/Mastercard reason codes with categories, descriptions, base win rates, and required evidence | ### Data Fields The dataset contains **65 columns** organized into the following groups: #### Identifiers & Dates (5 columns) | Column | Type | Description | |---|---|---| | `case_id` | string | Unique case identifier (e.g., `CB-2024-00001`) | | `transaction_id` | string | Unique transaction identifier (e.g., `TXN-550639328`) | | `transaction_date` | date | Date of the original transaction | | `dispute_date` | date | Date the chargeback was filed | | `resolution_date` | date | Date the case was resolved | #### Transaction Features (10 columns) | Column | Type | Description | |---|---|---| | `transaction_amount` | float | Dollar amount of the disputed transaction ($10–$6,000) | | `amount_log` | float | Log-transformed amount (`log1p(amount)`) | | `amount_vs_merchant_avg` | float | Ratio of transaction amount to merchant's average ticket | | `currency` | string | Transaction currency (USD, EUR, GBP, CAD) | | `payment_method` | string | Payment type (credit, debit, prepaid) | | `is_recurring` | binary | 1 if subscription/recurring billing | | `is_card_present` | binary | 1 if physical card was used (card-present transaction) | | `is_cross_border` | binary | 1 if transaction crossed country borders | | `is_digital_goods` | binary | 1 if product is digital (SaaS, streaming, downloads) | | `card_network` | string | Visa or Mastercard | #### Reason Code Features (5 columns) | Column | Type | Description | |---|---|---| | `reason_code` | string | Network-specific reason code (e.g., `Visa_10.4`, `MC_4837`) | | `reason_category` | string | High-level category: FRAUD, CONSUMER_DISPUTE, AUTHORIZATION, PROCESSING_ERROR | | `reason_description` | string | Human-readable description of the reason code | | `reason_code_encoded` | int | Integer-encoded reason code (0–14) | | `category_encoded` | int | Integer-encoded category (0–3) | #### Customer Features (9 columns) | Column | Type | Description | |---|---|---| | `customer_id` | string | Anonymized customer identifier | | `customer_tenure_months` | int | Months the customer has been with the bank | | `customer_total_orders` | int | Total historical orders with the merchant | | `customer_prior_disputes` | int | Number of previous chargebacks filed | | `customer_prior_win_rate` | float | Win rate on past disputes (0–1) | | `customer_risk_score` | float | Composite risk score (0–1, higher = riskier) | | `customer_account_verified` | binary | 1 if KYC-verified account | | `customer_email_domain` | string | Email provider domain | | `is_repeat_disputer` | binary | 1 if customer has filed 2+ previous disputes | #### Merchant Features (8 columns) | Column | Type | Description | |---|---|---| | `merchant_id` | string | Anonymized merchant identifier | | `merchant_name` | string | Synthetic merchant name | | `merchant_category_code` | int | MCC code (e.g., 5732 = Electronics) | | `merchant_category_name` | string | Human-readable MCC name | | `merchant_monthly_volume` | int | Monthly transaction volume | | `merchant_chargeback_rate` | float | Chargeback-to-transaction ratio (%) | | `merchant_avg_ticket` | float | Average transaction amount for this merchant | | `merchant_years_active` | int | Years the merchant has been processing payments | #### Evidence Features (16 columns) | Column | Type | Description | |---|---|---| | `has_delivery_tracking` | binary | 1 if shipment tracking number on file | | `has_delivery_confirmation` | binary | 1 if delivery confirmed by carrier | | `has_delivery_signature` | binary | 1 if signed proof of delivery exists | | `has_avs_match` | binary | 1 if billing address matched (Address Verification System) | | `has_cvv_match` | binary | 1 if CVV security code matched at authorization | | `has_3ds_authentication` | binary | 1 if 3D Secure authentication was completed | | `has_customer_communication` | binary | 1 if emails/chat with customer are on file | | `has_refund_policy_shown` | binary | 1 if refund policy was displayed at checkout | | `has_terms_accepted` | binary | 1 if customer accepted Terms & Conditions | | `has_ip_geomatch` | binary | 1 if customer IP location matches billing address | | `has_device_fingerprint` | binary | 1 if device recognized from previous sessions | | `has_usage_proof` | binary | 1 if login/access logs show product was used | | `has_prior_non_disputed_txn` | binary | 1 if customer had previous successful orders | | `evidence_count` | int | Total count of evidence items on file | | `evidence_completeness` | float | Fraction of required evidence items present (0–1) | | `evidence_strength_score` | float | Weighted composite evidence score (0–1) | #### Timing Features (5 columns) | Column | Type | Description | |---|---|---| | `days_to_dispute` | int | Days between transaction and chargeback filing | | `days_to_respond` | int | Days the merchant took to respond | | `deadline_days` | int | Total days allowed by network (Visa=30, MC=45) | | `response_within_deadline` | binary | 1 if merchant responded within the deadline | | `is_holiday_period` | binary | 1 if transaction occurred during holiday season (Nov–Jan 7) | #### Derived Features (2 columns) | Column | Type | Description | |---|---|---| | `dispute_velocity` | float | Rate of disputes per month of tenure | | `mcc_encoded` | int | Integer-encoded merchant category code | #### Target & Outcome (3 columns) | Column | Type | Description | |---|---|---| | `win_probability_true` | float | Ground-truth win probability used to generate the binary outcome | | `case_disposition` | string | `REPRESENTED` (merchant fought) or `ACCEPTED` (merchant accepted loss) | | `representment_won` | binary | **Target variable** — 1 if merchant won the representment, 0 if lost, NaN if not represented | ## Reason Codes The dataset covers 15 real Visa and Mastercard reason codes across 4 categories: | Category | Code | Description | Base Win Rate | Frequency | |---|---|---|---|---| | **FRAUD** | Visa_10.4 | Fraud — Card Not Present | 32% | 25.4% | | **FRAUD** | Visa_10.5 | Fraud — Counterfeit Transaction | 18% | 3.1% | | **FRAUD** | MC_4837 | No Cardholder Authorization | 28% | 8.3% | | **FRAUD** | MC_4863 | Cardholder Does Not Recognize | 38% | 3.9% | | **CONSUMER_DISPUTE** | Visa_13.1 | Merchandise/Services Not Received | 55% | 14.6% | | **CONSUMER_DISPUTE** | Visa_13.3 | Not as Described / Defective | 42% | 7.9% | | **CONSUMER_DISPUTE** | MC_4853 | Goods/Services Not as Described | 40% | 7.4% | | **CONSUMER_DISPUTE** | MC_4855 | Goods/Services Not Received | 56% | 4.9% | | **AUTHORIZATION** | Visa_11.1 | Card Recovery Bulletin | 22% | 2.8% | | **AUTHORIZATION** | Visa_11.2 | Declined Authorization | 52% | 4.9% | | **AUTHORIZATION** | MC_4808 | Authorization Related | 48% | 7.1% | | **PROCESSING_ERROR** | Visa_12.1 | Late Presentment | 28% | 2.9% | | **PROCESSING_ERROR** | Visa_12.2 | Incorrect Transaction Code | 62% | 2.0% | | **PROCESSING_ERROR** | Visa_12.5 | Incorrect Amount | 68% | 2.0% | | **PROCESSING_ERROR** | MC_4834 | Duplicate Processing | 72% | 2.9% | ## Dataset Creation ### Generation Process The dataset was generated synthetically using a Python script (`generate_dataset.py`) with the following methodology: 1. **Customer Profiles**: 3,000 unique customers with realistic tenure, order history, dispute history, and risk profiles drawn from exponential and beta distributions. 2. **Merchant Pool**: 20 merchants across 14 MCC categories with varying volumes, chargeback rates, and ticket sizes. 3. **Transaction Generation**: Each case is assigned a reason code (weighted by real-world frequency), merchant, customer, and transaction details. 4. **Evidence Simulation**: Evidence flags are generated with base rates that vary by reason category, transaction amount, digital vs. physical goods, and customer tenure — reflecting real-world correlations. 5. **Outcome Generation**: Win probability is computed from a rule-based function incorporating reason code base rates, evidence presence, customer history, amount, and timing. Binary outcomes are sampled from this probability with added noise. 6. **Time-Based Splitting**: Represented cases are sorted by transaction date and split 60/20/20 into train/val/test. ### Realistic Correlations Built In - **3D Secure authentication** provides a large boost for fraud cases (liability shift) - **Delivery confirmation + signature** are critical for consumer dispute cases - **Higher transaction amounts** reduce win probability (harder to defend) - **Cross-border transactions** carry a penalty - **Repeat disputers** (serial filers) are easier to defend against - **Digital goods** require usage proof instead of delivery evidence - **Missing deadline response** results in automatic loss ### Random Seeds All random number generators use `seed=42` for full reproducibility. ## Intended Use ### Primary Use Cases - **ML model training**: Train classifiers to predict chargeback representment outcomes - **Explainability research**: Study feature importance and SHAP explanations in financial decision-making - **AI assistant development**: Build AI-assisted analyst tools for chargeback review - **Educational**: Learn about chargeback processes, evidence requirements, and ML applied to fintech ### Out-of-Scope Uses - **Production fraud detection**: This is synthetic data and should not be used as a substitute for real transaction data in production fraud systems. - **Regulatory compliance**: The dataset does not reflect actual card network rules or regulatory requirements. ## Baseline Model Performance An XGBoost classifier trained on this dataset achieves: | Metric | Value | |---|---| | AUC-ROC | 0.719 | | Accuracy | 70.6% | | Precision | 73.2% | | Recall | 90.9% | | F1 Score | 0.811 | | Brier Score | 0.188 | The model uses 45 features (numeric, binary, and encoded categorical) with Platt scaling (sigmoid calibration) for probability calibration. ## Limitations and Biases - **Synthetic data**: All cases are generated programmatically. While correlations are designed to be realistic, they may not capture the full complexity of real-world chargeback patterns. - **Simplified evidence model**: Real evidence evaluation involves document quality, completeness of narratives, and network-specific rules that are not captured here. - **Fixed merchant pool**: Only 20 merchants are simulated, which limits merchant-level diversity. - **No temporal drift**: The data generation process is stationary — real-world chargeback patterns shift over time due to policy changes, fraud trends, and seasonal effects. - **US-centric**: Amount distributions and merchant categories are primarily modeled on US market patterns. ## Citation If you use this dataset in your research or projects, please cite: ```bibtex @misc{synthetic_chargeback_cases_2025, title={Synthetic Chargeback Cases for Representment Win Prediction}, author={azizstark}, year={2025}, url={https://huggingface.co/datasets/azizstark/synthetic-chargeback-cases}, note={Synthetic dataset of 10,000 credit card chargeback cases for representment win prediction} } ``` ## License This dataset is released under the [MIT License](https://opensource.org/licenses/MIT). ## Dataset Card Contact For questions or feedback, please open an issue on the dataset's discussion tab.

提供机构：

azizstark

5,000+

优质数据集

54 个

任务类型

进入经典数据集