azizstark/synthetic-chargeback-cases
收藏Hugging Face2026-04-20 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/azizstark/synthetic-chargeback-cases
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
task_categories:
- tabular-classification
language:
- en
tags:
- chargeback
- fraud-detection
- fintech
- payments
- credit-card
- xgboost
- synthetic
- representment
- dispute-resolution
pretty_name: Synthetic Chargeback Cases for Representment Win Prediction
size_categories:
- 1K<n<10K
configs:
- config_name: default
data_files:
- split: train
path: chargeback_cases_train.csv
- split: validation
path: chargeback_cases_val.csv
- split: test
path: chargeback_cases_test.csv
- config_name: full
data_files:
- split: train
path: chargeback_cases_full.csv
- config_name: reason_codes
data_files:
- split: train
path: reason_codes_reference.csv
---
# Synthetic Chargeback Cases for Representment Win Prediction
## Dataset Description
A synthetic dataset of **10,000 credit card chargeback cases** designed for training and evaluating machine learning models that predict **representment win probability** — the likelihood a merchant will win if they fight a chargeback dispute.
The dataset models realistic chargeback workflows including evidence collection, reason code categorization, customer/merchant profiling, and outcome determination — suitable for building AI-assisted analyst tools that summarize cases, assemble evidence, and provide policy-aligned recommendations (REPRESENT / REFUND / ESCALATE).
### Supported Tasks
- **Binary Classification**: Predict whether a merchant will win (`representment_won = 1`) or lose (`representment_won = 0`) a chargeback representment.
- **Probability Estimation**: Estimate calibrated win probability for prioritizing cases.
- **Explainability Research**: Analyze which evidence types and case features drive outcomes using SHAP or similar methods.
### Languages
English (all text fields are in English).
## Dataset Summary
| Property | Value |
|---|---|
| Total cases | 10,000 |
| Represented cases (with outcome) | 7,804 |
| Accepted cases (no outcome) | 2,196 |
| Date range | Jan 2024 – Dec 2025 |
| Card networks | Visa (65.6%), Mastercard (34.4%) |
| Reason codes | 15 (Visa + Mastercard) |
| Chargeback categories | 4 (Fraud, Consumer Dispute, Authorization, Processing Error) |
| Total columns | 65 |
| Overall win rate (represented) | ~69.7% |
| Transaction amount range | $10.00 – $6,000.00 |
| Currencies | USD (82.6%), EUR (9.3%), GBP (4.8%), CAD (3.3%) |
## Dataset Structure
### Data Splits
The dataset is split **by time** (not randomly) to simulate real-world deployment where models predict future cases from past data:
| Split | File | Rows | Time Period | Purpose |
|---|---|---|---|---|
| `train` | `chargeback_cases_train.csv` | 4,682 | Jan 2024 – ~Jul 2025 | Model training |
| `validation` | `chargeback_cases_val.csv` | 1,561 | ~Aug 2025 – ~Oct 2025 | Hyperparameter tuning |
| `test` | `chargeback_cases_test.csv` | 1,561 | ~Nov 2025 – Dec 2025 | Final evaluation |
| `full` | `chargeback_cases_full.csv` | 10,000 | Jan 2024 – Dec 2025 | All cases (including non-represented) |
> **Note:** The train/val/test splits contain only **represented** cases (where `case_disposition == "REPRESENTED"`) and have a binary `representment_won` label. The full dataset also includes **accepted** cases (where the merchant did not fight the chargeback), which have `NaN` for `representment_won`.
### Additional Files
| File | Description |
|---|---|
| `reason_codes_reference.csv` | Lookup table for all 15 Visa/Mastercard reason codes with categories, descriptions, base win rates, and required evidence |
### Data Fields
The dataset contains **65 columns** organized into the following groups:
#### Identifiers & Dates (5 columns)
| Column | Type | Description |
|---|---|---|
| `case_id` | string | Unique case identifier (e.g., `CB-2024-00001`) |
| `transaction_id` | string | Unique transaction identifier (e.g., `TXN-550639328`) |
| `transaction_date` | date | Date of the original transaction |
| `dispute_date` | date | Date the chargeback was filed |
| `resolution_date` | date | Date the case was resolved |
#### Transaction Features (10 columns)
| Column | Type | Description |
|---|---|---|
| `transaction_amount` | float | Dollar amount of the disputed transaction ($10–$6,000) |
| `amount_log` | float | Log-transformed amount (`log1p(amount)`) |
| `amount_vs_merchant_avg` | float | Ratio of transaction amount to merchant's average ticket |
| `currency` | string | Transaction currency (USD, EUR, GBP, CAD) |
| `payment_method` | string | Payment type (credit, debit, prepaid) |
| `is_recurring` | binary | 1 if subscription/recurring billing |
| `is_card_present` | binary | 1 if physical card was used (card-present transaction) |
| `is_cross_border` | binary | 1 if transaction crossed country borders |
| `is_digital_goods` | binary | 1 if product is digital (SaaS, streaming, downloads) |
| `card_network` | string | Visa or Mastercard |
#### Reason Code Features (5 columns)
| Column | Type | Description |
|---|---|---|
| `reason_code` | string | Network-specific reason code (e.g., `Visa_10.4`, `MC_4837`) |
| `reason_category` | string | High-level category: FRAUD, CONSUMER_DISPUTE, AUTHORIZATION, PROCESSING_ERROR |
| `reason_description` | string | Human-readable description of the reason code |
| `reason_code_encoded` | int | Integer-encoded reason code (0–14) |
| `category_encoded` | int | Integer-encoded category (0–3) |
#### Customer Features (9 columns)
| Column | Type | Description |
|---|---|---|
| `customer_id` | string | Anonymized customer identifier |
| `customer_tenure_months` | int | Months the customer has been with the bank |
| `customer_total_orders` | int | Total historical orders with the merchant |
| `customer_prior_disputes` | int | Number of previous chargebacks filed |
| `customer_prior_win_rate` | float | Win rate on past disputes (0–1) |
| `customer_risk_score` | float | Composite risk score (0–1, higher = riskier) |
| `customer_account_verified` | binary | 1 if KYC-verified account |
| `customer_email_domain` | string | Email provider domain |
| `is_repeat_disputer` | binary | 1 if customer has filed 2+ previous disputes |
#### Merchant Features (8 columns)
| Column | Type | Description |
|---|---|---|
| `merchant_id` | string | Anonymized merchant identifier |
| `merchant_name` | string | Synthetic merchant name |
| `merchant_category_code` | int | MCC code (e.g., 5732 = Electronics) |
| `merchant_category_name` | string | Human-readable MCC name |
| `merchant_monthly_volume` | int | Monthly transaction volume |
| `merchant_chargeback_rate` | float | Chargeback-to-transaction ratio (%) |
| `merchant_avg_ticket` | float | Average transaction amount for this merchant |
| `merchant_years_active` | int | Years the merchant has been processing payments |
#### Evidence Features (16 columns)
| Column | Type | Description |
|---|---|---|
| `has_delivery_tracking` | binary | 1 if shipment tracking number on file |
| `has_delivery_confirmation` | binary | 1 if delivery confirmed by carrier |
| `has_delivery_signature` | binary | 1 if signed proof of delivery exists |
| `has_avs_match` | binary | 1 if billing address matched (Address Verification System) |
| `has_cvv_match` | binary | 1 if CVV security code matched at authorization |
| `has_3ds_authentication` | binary | 1 if 3D Secure authentication was completed |
| `has_customer_communication` | binary | 1 if emails/chat with customer are on file |
| `has_refund_policy_shown` | binary | 1 if refund policy was displayed at checkout |
| `has_terms_accepted` | binary | 1 if customer accepted Terms & Conditions |
| `has_ip_geomatch` | binary | 1 if customer IP location matches billing address |
| `has_device_fingerprint` | binary | 1 if device recognized from previous sessions |
| `has_usage_proof` | binary | 1 if login/access logs show product was used |
| `has_prior_non_disputed_txn` | binary | 1 if customer had previous successful orders |
| `evidence_count` | int | Total count of evidence items on file |
| `evidence_completeness` | float | Fraction of required evidence items present (0–1) |
| `evidence_strength_score` | float | Weighted composite evidence score (0–1) |
#### Timing Features (5 columns)
| Column | Type | Description |
|---|---|---|
| `days_to_dispute` | int | Days between transaction and chargeback filing |
| `days_to_respond` | int | Days the merchant took to respond |
| `deadline_days` | int | Total days allowed by network (Visa=30, MC=45) |
| `response_within_deadline` | binary | 1 if merchant responded within the deadline |
| `is_holiday_period` | binary | 1 if transaction occurred during holiday season (Nov–Jan 7) |
#### Derived Features (2 columns)
| Column | Type | Description |
|---|---|---|
| `dispute_velocity` | float | Rate of disputes per month of tenure |
| `mcc_encoded` | int | Integer-encoded merchant category code |
#### Target & Outcome (3 columns)
| Column | Type | Description |
|---|---|---|
| `win_probability_true` | float | Ground-truth win probability used to generate the binary outcome |
| `case_disposition` | string | `REPRESENTED` (merchant fought) or `ACCEPTED` (merchant accepted loss) |
| `representment_won` | binary | **Target variable** — 1 if merchant won the representment, 0 if lost, NaN if not represented |
## Reason Codes
The dataset covers 15 real Visa and Mastercard reason codes across 4 categories:
| Category | Code | Description | Base Win Rate | Frequency |
|---|---|---|---|---|
| **FRAUD** | Visa_10.4 | Fraud — Card Not Present | 32% | 25.4% |
| **FRAUD** | Visa_10.5 | Fraud — Counterfeit Transaction | 18% | 3.1% |
| **FRAUD** | MC_4837 | No Cardholder Authorization | 28% | 8.3% |
| **FRAUD** | MC_4863 | Cardholder Does Not Recognize | 38% | 3.9% |
| **CONSUMER_DISPUTE** | Visa_13.1 | Merchandise/Services Not Received | 55% | 14.6% |
| **CONSUMER_DISPUTE** | Visa_13.3 | Not as Described / Defective | 42% | 7.9% |
| **CONSUMER_DISPUTE** | MC_4853 | Goods/Services Not as Described | 40% | 7.4% |
| **CONSUMER_DISPUTE** | MC_4855 | Goods/Services Not Received | 56% | 4.9% |
| **AUTHORIZATION** | Visa_11.1 | Card Recovery Bulletin | 22% | 2.8% |
| **AUTHORIZATION** | Visa_11.2 | Declined Authorization | 52% | 4.9% |
| **AUTHORIZATION** | MC_4808 | Authorization Related | 48% | 7.1% |
| **PROCESSING_ERROR** | Visa_12.1 | Late Presentment | 28% | 2.9% |
| **PROCESSING_ERROR** | Visa_12.2 | Incorrect Transaction Code | 62% | 2.0% |
| **PROCESSING_ERROR** | Visa_12.5 | Incorrect Amount | 68% | 2.0% |
| **PROCESSING_ERROR** | MC_4834 | Duplicate Processing | 72% | 2.9% |
## Dataset Creation
### Generation Process
The dataset was generated synthetically using a Python script (`generate_dataset.py`) with the following methodology:
1. **Customer Profiles**: 3,000 unique customers with realistic tenure, order history, dispute history, and risk profiles drawn from exponential and beta distributions.
2. **Merchant Pool**: 20 merchants across 14 MCC categories with varying volumes, chargeback rates, and ticket sizes.
3. **Transaction Generation**: Each case is assigned a reason code (weighted by real-world frequency), merchant, customer, and transaction details.
4. **Evidence Simulation**: Evidence flags are generated with base rates that vary by reason category, transaction amount, digital vs. physical goods, and customer tenure — reflecting real-world correlations.
5. **Outcome Generation**: Win probability is computed from a rule-based function incorporating reason code base rates, evidence presence, customer history, amount, and timing. Binary outcomes are sampled from this probability with added noise.
6. **Time-Based Splitting**: Represented cases are sorted by transaction date and split 60/20/20 into train/val/test.
### Realistic Correlations Built In
- **3D Secure authentication** provides a large boost for fraud cases (liability shift)
- **Delivery confirmation + signature** are critical for consumer dispute cases
- **Higher transaction amounts** reduce win probability (harder to defend)
- **Cross-border transactions** carry a penalty
- **Repeat disputers** (serial filers) are easier to defend against
- **Digital goods** require usage proof instead of delivery evidence
- **Missing deadline response** results in automatic loss
### Random Seeds
All random number generators use `seed=42` for full reproducibility.
## Intended Use
### Primary Use Cases
- **ML model training**: Train classifiers to predict chargeback representment outcomes
- **Explainability research**: Study feature importance and SHAP explanations in financial decision-making
- **AI assistant development**: Build AI-assisted analyst tools for chargeback review
- **Educational**: Learn about chargeback processes, evidence requirements, and ML applied to fintech
### Out-of-Scope Uses
- **Production fraud detection**: This is synthetic data and should not be used as a substitute for real transaction data in production fraud systems.
- **Regulatory compliance**: The dataset does not reflect actual card network rules or regulatory requirements.
## Baseline Model Performance
An XGBoost classifier trained on this dataset achieves:
| Metric | Value |
|---|---|
| AUC-ROC | 0.719 |
| Accuracy | 70.6% |
| Precision | 73.2% |
| Recall | 90.9% |
| F1 Score | 0.811 |
| Brier Score | 0.188 |
The model uses 45 features (numeric, binary, and encoded categorical) with Platt scaling (sigmoid calibration) for probability calibration.
## Limitations and Biases
- **Synthetic data**: All cases are generated programmatically. While correlations are designed to be realistic, they may not capture the full complexity of real-world chargeback patterns.
- **Simplified evidence model**: Real evidence evaluation involves document quality, completeness of narratives, and network-specific rules that are not captured here.
- **Fixed merchant pool**: Only 20 merchants are simulated, which limits merchant-level diversity.
- **No temporal drift**: The data generation process is stationary — real-world chargeback patterns shift over time due to policy changes, fraud trends, and seasonal effects.
- **US-centric**: Amount distributions and merchant categories are primarily modeled on US market patterns.
## Citation
If you use this dataset in your research or projects, please cite:
```bibtex
@misc{synthetic_chargeback_cases_2025,
title={Synthetic Chargeback Cases for Representment Win Prediction},
author={azizstark},
year={2025},
url={https://huggingface.co/datasets/azizstark/synthetic-chargeback-cases},
note={Synthetic dataset of 10,000 credit card chargeback cases for representment win prediction}
}
```
## License
This dataset is released under the [MIT License](https://opensource.org/licenses/MIT).
## Dataset Card Contact
For questions or feedback, please open an issue on the dataset's discussion tab.
提供机构:
azizstark



