Nachammai41/fraud-detection-underserved-20k

Name: Nachammai41/fraud-detection-underserved-20k
Creator: Nachammai41
Published: 2026-04-11 00:48:28
License: 暂无描述

Hugging Face2026-04-11 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/Nachammai41/fraud-detection-underserved-20k

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 task_categories: - text-classification - tabular-classification language: - en - es - hi - ht - yo - vi - ta modalities: - tabular - text tags: - fraud-detection - finance - synthetic - multilingual - tabular - remittance - underserved-communities - tab-ddpm - reasoning-traces size_categories: - 10K<n<100K --- # Underserved Financial Fraud Detection Dataset [![Banner: four community archetypes, 20k records, 8 languages](https://huggingface.co/datasets/Nachammai41/fraud-detection-underserved-20k/resolve/main/banner.png)](https://huggingface.co/datasets/Nachammai41/fraud-detection-underserved-20k) **Created with Adaptive Data by Adaption** | CC BY 4.0 | 20,000 records | 8 languages | 390 reasoning traces The first open-source fraud dataset built specifically around populations that PaySim, Sparkov, and IEEE-CIS have never modeled — immigrant remittance senders, gig workers, unbanked cash users, and ITIN-based entrepreneurs. These communities are disproportionately targeted by fraud. This dataset fills that gap. > This dataset was published as a submission to the [Uncharted Data Challenge](https://www.adaptionlabs.ai/blog/the-uncharted-data-challenge) powered by [Adaptive Data](https://www.adaptionlabs.ai/blog/adaption-launches-adaptive-data-beta). It extends the [Fraud Detection Framework](https://github.com/nachammai779/Fraud-Detection-Framework---An-Agentic-RAG-Pipeline-with-Custom-Financial-SLM/tree/unchartered_data_challenge) — an Agentic RAG pipeline with a custom Financial SLM (AUC-ROC 0.9486 on IEEE-CIS). --- ## Dataset Summary | Statistic | Value | |---|---| | Total records | 20,000 | | Records per archetype | 5,000 | | Fraud rate | ~10% (vs. 3.5% in IEEE-CIS) | | Languages | 8 | | AI-generated narratives | 15,076 | | Reasoning traces | 390 | | Seed narratives (scraped) | 1,040 | | Schema fields | 31 (8 universal + 23 archetype extensions) | | Generation method | Tab-DDPM (denoising diffusion for tabular data) | | Narrative quality (Adaptive Data) | E (5.0) → A (9.2–9.4) | --- ## The Four Archetypes | Archetype | Who | Payment Instruments | Fraud Vectors | Languages | |---|---|---|---|---| | **Remittance Sender** | Immigrants sending money cross-border | Western Union, Remitly, MoneyGram, Xoom, Hawala | Emergency call scams, exchange rate manipulation, interception, fake family | `es` `ht` `yo` `hi` `en` | | **Gig Worker** | Uber, DoorDash, Instacart workers | CashApp, Venmo, platform payout portals | Account takeover, SIM swap, fake platform support calls | `en` `hi` `vi` `es` `yo` | | **Unbanked Cash User** | Populations using prepaid cards and retail kiosks | Prepaid Visa/Mastercard, retail kiosks, Green Dot | Predatory micro-loans, load-fee scams, fake utility kiosks | `en` `es` `vi` `yo` `hi` | | **ITIN Entrepreneur** | Immigrant small business owners (no SSN) | Business checking, ACH, informal credit | Synthetic identity fraud, fake tax returns, mule accounts | `en` `es` `hi` `ta` `vi` | ### Archetype Examples #### Remittance Sender — Confirmed Fraud > *"I logged into my Xoom account, heart heavy with the responsibility of sending my life savings of $10,000 back home to fix my sister's roof, trusting the screen when it promised a specific exchange rate that seemed fair enough... But by the time the transfer cleared, a fraudulent operator had quietly applied a different rate, pocketing the difference."* > > `is_fraud: 1` | `fraud_vector: exchange_rate` | `instrument: Xoom` | `language: en` #### Remittance Sender — Benign (No Fraud) > *"I stood in line for twenty minutes, clutching the transfer reference number for the $419.89 my daughter sent from abroad, believing I was finally securing the rent money we desperately needed."* > > `is_fraud: 0` | `fraud_vector: exchange_rate` | `instrument: Cash pickup` | `language: en` #### Multilingual Example — Haitian Creole > *"Nan laj swasanncinq ane, nan vye zòtèy mwen, mwen te kwè yon mesaj ki pwomèt yon bonis espesyal sou Xoom pou voye lajan bay fanmi m nan Pòtoprens..."* > > `is_fraud: 0` | `fraud_vector: bonus` | `instrument: Xoom` | `language: ht` #### Multilingual Example — Yoruba > *"Mo jẹ́ ọmọ ọdún aárùn-únlélógóta, ìbànújẹ́ sì ń kun ọkàn mi nítorí owó ẹgbẹ̀wàá dọ́là ($10,000)..."* > > `is_fraud: 1` | `fraud_vector: exchange_rate` | `instrument: Wire transfer` | `language: yo` --- ## Languages | Code | Language | Notes | |---|---|---| | `en` | English | All archetypes | | `es` | Spanish | Remittance, ITIN, Unbanked | | `hi` | Hinglish | Gig worker, ITIN | | `ht` | Haitian Creole | Remittance | | `yo` | Yoruba | Remittance, Gig worker | | `vi` | Vietnamese | Remittance, Unbanked, ITIN | | `ta` | Tamil | ITIN | | `ta-en` | Tamil-English code-switch | ITIN | Fraud terminology is preserved in-language (e.g. *estafa*, *fraude*, *ìbànújẹ́*) — not translated into English. This is intentional: multilingual fraud classifiers must learn scam signals in the victim's actual language. --- ## What Makes It Different **No existing fraud dataset covers this population.** PaySim simulates generic mobile money. Sparkov models middle-class credit cards. IEEE-CIS captures e-commerce. None model remittance kiosks, gig payouts, or ITIN-linked accounts. | Dataset | Population | Fraud Rate | Narratives | Reasoning Traces | |---|---|---|---|---| | PaySim | Generic mobile money | 0.1% | None | None | | Sparkov | Middle-class credit cards | ~0.6% | None | None | | IEEE-CIS | E-commerce (Vesta) | 3.5% | None | None | | **This dataset** | **Underserved communities** | **~10%** | **15,076 (8 languages)** | **390** | The elevated fraud rate (~10%) reflects the disproportionate targeting of these communities — not a sampling artifact. **Generated with diffusion, not rules.** Tabular data generated using Tab-DDPM (denoising diffusion for tabular data), which learns joint correlations across behavioral features — not just independent column sampling. Trained on A100 GPU via Google Colab Pro. **Multilingual narrative text.** Every fraud transaction has a `narrative_text` field — the scam account in the community's language, written in first person. Generated by Adaptive Data by Adaption. Quality score improved from E (5.0) to A (9.2–9.4). **Reasoning traces.** 390 chain-of-thought fraud analysis examples — step-by-step investigator reasoning grounded in community-specific fraud signals. No existing open fraud dataset includes this. Built for fine-tuning financial language models (FinBERT, Gemma, Llama). --- ## Data Sources ### Seed Narratives | Source | Records | Content | |---|---|---| | CFPB Consumer Complaint Database | ~400 | Real consumer fraud reports, remittance and wire fraud | | BBB Scam Tracker | ~300 | Community-reported scam narratives | | Reddit archive (Pullpush.io) | ~340 | First-person fraud accounts from immigrant community subreddits | ### Behavioral Distributions Archetype distributions (transaction amounts, channels, corridors, fraud vector probabilities) derived from scraped narratives and World Bank remittance corridor data — not empirically measured transaction logs. --- ## Data Processing Pipeline ``` 1. Scrape 1,040 real fraud narratives from CFPB, BBB Scam Tracker, and Reddit archive (Pullpush.io) 2. Profile Behavioral distributions per archetype derived from scraped narratives — amounts, channels, corridors, fraud vectors, language mix 3. Generate Tab-DDPM trains on 5,000 seed rows per archetype, learns joint feature correlations, generates 5,000 synthetic transactions per archetype (A100 GPU, Colab Pro) 4. Narrate Adaptive Data by Adaption fills narrative_text in 8 languages per transaction's fraud context. Quality: E (5.0) → A (9.2–9.4) 5. Trace 390 chain-of-thought reasoning traces generated — investigator-style fraud analysis for fine-tuning use ``` ### Fraud Label Assignment | `is_fraud` | Meaning | Rate | |---|---|---| | `1` | Confirmed fraud transaction | ~10% | | `0` | Legitimate or ambiguous transaction | ~90% | The fraud rate deliberately exceeds IEEE-CIS (3.5%) to reflect the higher victimization rate of these communities. Fraud labels are assigned at the Tab-DDPM training stage, not post-hoc. --- ## Schema ### Loading the Dataset ```python from datasets import load_dataset ds = load_dataset("Nachammai41/fraud-detection-underserved-20k") train = ds["train"] # Example: filter to confirmed fraud in Yoruba fraud_yo = train.filter(lambda x: x["is_fraud"] == 1 and x["language"] == "yo") # Example: filter by archetype remittance = train.filter(lambda x: x["archetype"] == "remittance") ``` ### Universal Fields (all archetypes) | Field | Type | Description | |---|---|---| | `data_uuid` | string | Unique record identifier (UUID v4) | | `id` | string | Synthetic record ID (`synth_...`) | | `archetype` | string | `remittance` / `gig_worker` / `unbanked` / `itin` | | `source` | string | Generation method: `tabddpm_synthetic` | | `narrative_text` | string | First-person scam narrative in community language | | `detected_language_hints` | list[str] | ISO 639-1 codes detected in narrative | | `fraud_vector_hint` | string | High-level fraud signal (keyword/mechanism) | | `fraud_vector` | string | Normalized fraud vector category | | `language` | string | Primary language of `narrative_text` | | `is_fraud` | int (0/1) | Ground truth fraud label | | `record_timestamp` | ISO 8601 | Synthetic transaction timestamp | ### Transaction Fields | Field | Type | Description | |---|---|---| | `transaction_amount_usd` | float | Transaction amount in USD | | `fee_amount_usd` | float | Fee charged on the transaction | | `instrument` | string | Payment instrument (Xoom, Wire transfer, MoneyGram, Hawala, Cash pickup…) | | `hour_of_day` | int 0–23 | Hour of transaction (UTC) — timing behavioral signal | | `day_of_week` | int 0–6 | Day of week (0 = Monday) | | `day_of_week_name` | string | Human-readable day name | ### Sender Behavioral Features | Field | Type | Description | |---|---|---| | `sender_age` | int | Sender age in years | | `days_since_last_txn` | int | Days since the account's previous transaction | | `account_age_days` | int | Days since account was opened | | `txn_count_30d` | int | Number of transactions in the past 30 days | ### Notes on Edge Cases 1. **Fraud label vs. fraud vector**: `is_fraud=0` rows still have a `fraud_vector` — these are transactions that *looked like* a scam pattern but were ultimately benign (legitimate emergency remittances, genuine exchange rate queries, etc.). This mirrors real-world investigator experience where most flagged transactions are legitimate. 2. **Narrative text artifacts**: A small number of rows contain model refusal artifacts (e.g. rows beginning *"I cannot generate content from the perspective of a scammer..."*). These are retained as-is; they appear in the `narrative_text` column and may be filtered with `narrative_text.startswith("I cannot")`. 3. **Reasoning traces**: Only ~390 of 20,000 rows have a reasoning trace. Filter with `ds.filter(lambda x: x.get("reasoning_trace") is not None)` if using for fine-tuning. 4. **ITIN archetype data**: Refer to independent companion datasets `itin_fraud_narratives` and `itin_fraud_narratives_reasoning` for archetype-level quality evaluation scores and individual split details. 5. **Synthetic timestamps**: All `record_timestamp` values fall in the 2024 calendar year. Date distributions within the year reflect behavioral patterns from scraped narratives (e.g. higher fraud volume on weekends, late-night transaction spikes). --- ## Intended Use - **Training fraud detection models** on underserved community transaction patterns and multilingual narratives - **Benchmarking existing models** (e.g. trained on IEEE-CIS) against this population — the core research question driving this dataset: *how does a model trained on mainstream data perform on populations it has never seen?* - **Fine-tuning financial language models** (FinBERT, Gemma, Llama) on multilingual fraud narratives and chain-of-thought reasoning traces - **AI fairness and financial inclusion research** — measuring disparate model performance across demographic proxies - **NLP research** on under-resourced financial language varieties (Haitian Creole, Yoruba, Tamil) --- ## What This Is Not This is a **fully synthetic** dataset. No real transaction data. No PII. Behavioral distributions are informed by public fraud narratives and World Bank remittance corridor data — not empirically measured transaction logs. Like all synthetic fraud datasets (PaySim, Sparkov, Cifer-AF), ground truth validation against real data is not possible due to privacy constraints. --- ## Companion Datasets Refer to the following independent datasets for per-archetype quality evaluation and reasoning trace subsets: - `remittance_fraud_narratives` - `gig_worker_fraud_narratives` - `unbanked_fraud_narratives` - `itin_fraud_narratives` - `remittance_fraud_narratives_reasoning` - `gig_worker_fraud_narratives_reasoning` - `unbanked_fraud_narratives_reasoning` - `itin_fraud_narratives_reasoning` Final quality across all individual datasets: **A**, with relative quality improvement of **84–88%**. --- ## Citation ```bibtex @dataset{palaniappan2026underserved, author = {Palaniappan, Nachammai}, title = {Underserved Financial Fraud Detection Dataset}, year = {2026}, publisher = {HuggingFace}, note = {Created with Adaptive Data by Adaption. Uncharted Data Challenge, Adaption Labs, April 2026. Extends the Fraud Detection Framework (AUC-ROC 0.9486 on IEEE-CIS).}, url = {https://huggingface.co/datasets/Nachammai41/fraud-detection-underserved-20k} } ``` --- ## Credits - **Adaptive Data by Adaption** — Narrative generation and dataset quality enrichment - **Tab-DDPM** (Kotelnikov et al., 2022) — Tabular denoising diffusion model - **CFPB** — Consumer Financial Protection Bureau public complaint database - **BBB Scam Tracker** — Better Business Bureau public scam reports - **Pullpush.io** — Reddit archive API - **World Bank** — Remittance corridor behavioral data --- *License: CC BY 4.0 — Free to use with attribution*

提供机构：

Nachammai41

搜集汇总

数据集介绍

构建方式

在金融欺诈检测领域，现有数据集往往聚焦于主流交易模式，而忽视了边缘化群体的特定风险。本数据集采用创新的合成数据生成方法，以1040条来自CFPB消费者投诉数据库、BBB诈骗追踪器和Reddit社区的真实欺诈叙事为种子，通过Tab-DDPM（表格去噪扩散概率模型）学习各社区原型的行为特征联合分布。该模型在A100 GPU上训练，为四个社区原型各生成5000条合成交易记录，并利用Adaptive Data技术以八种语言填充第一人称叙事文本，最终构建了包含20000条记录、欺诈率约10%的多语言数据集。

特点

本数据集的核心特征在于其专注于服务不足的金融社区，包括跨境汇款者、零工经济工作者、无银行账户现金用户和ITIN企业家，这些群体在传统数据集中长期缺失。数据集涵盖八种语言，叙事文本保留了欺诈术语的原语言表达，如西班牙语的“estafa”和约鲁巴语的“ìbànújẹ́”，确保了语言真实性。此外，数据集包含390条链式思维推理轨迹，为模型提供了逐步欺诈分析示例，并设计了31个模式字段，其中8个为通用字段，23个为社区原型扩展字段，以捕捉多维行为信号。

使用方法

该数据集适用于训练针对边缘化社区交易模式的欺诈检测模型，可作为基准测试工具，评估基于主流数据训练的模型在新群体上的性能表现。研究人员可利用其多语言叙事文本和推理轨迹微调金融领域语言模型，如FinBERT或Gemma，以提升跨语言欺诈识别能力。在伦理计算领域，该数据集支持算法公平性研究，帮助分析模型在不同人口统计代理变量上的性能差异。使用时可借助HuggingFace的datasets库加载，并通过过滤函数按欺诈标签、语言或社区原型进行数据子集提取。

背景与挑战

背景概述

在金融科技与欺诈检测领域，现有数据集如PaySim、Sparkov和IEEE-CIS主要聚焦于主流电子支付、信用卡及电子商务场景，未能充分涵盖移民汇款、零工经济从业者、无银行账户现金用户及个体纳税人识别号企业家等金融服务不足群体。针对这一研究空白，研究人员Nachammai Palaniappan于2026年依托Adaption Labs的适应性数据技术，构建了首个专注于服务不足社区的开源欺诈检测数据集。该数据集通过合成数据生成方法，模拟了四类社区原型的交易行为与多语言叙事文本，旨在推动金融包容性研究，并检验基于主流数据训练的模型在未见群体上的泛化性能，对提升欺诈检测系统的公平性与覆盖面具有重要学术价值。

当前挑战

该数据集致力于解决金融服务不足社区欺诈检测的领域挑战，这些社区因交易模式独特、语言多样且数据稀缺，传统模型难以有效识别其欺诈行为。构建过程中的挑战主要体现在数据合成与真实性平衡方面：首先，采用Tab-DDPM扩散模型生成表格数据时，需确保行为特征间的联合相关性符合真实世界分布，而非独立列采样；其次，为涵盖八种语言（包括海地克里奥尔语、约鲁巴语等资源匮乏语种），叙事文本生成需在保持欺诈术语原语言表达的同时，避免翻译导致的语义失真；此外，种子叙事仅源自公开投诉与社区报告，缺乏实证交易日志，使得数据分布依赖于叙事挖掘与世界银行汇款走廊数据的推断，可能引入潜在偏差。

常用场景

经典使用场景

在金融欺诈检测领域，传统数据集往往聚焦于主流支付场景，而fraud-detection-underserved-20k数据集则专为服务不足的社群设计，填补了研究空白。其经典使用场景在于训练和评估欺诈检测模型，特别是针对移民汇款发送者、零工经济工作者、无银行账户现金用户以及使用个人纳税识别号的创业者这四类典型群体。通过合成数据生成技术，该数据集模拟了这些社群特有的交易行为模式和多语言叙事文本，为模型提供了丰富的跨文化欺诈信号学习素材。

衍生相关工作

围绕该数据集已衍生出多项经典研究工作，主要集中在两大方向。一方面，研究者基于其多语言叙事文本和推理轨迹，对预训练金融语言模型进行微调，开发了能够理解海地克里奥尔语或泰米尔语中欺诈术语的专用分类器。另一方面，该数据集常被用作基准测试工具，用于评估在IEEE-CIS等主流数据集上训练的模型在服务不足群体上的性能迁移能力，相关研究揭示了模型偏差的具体表现，并催生了针对合成数据生成质量评估的跨领域方法论探讨。

数据集最近研究