Nachammai41/unbanked-fraud-narratives

Name: Nachammai41/unbanked-fraud-narratives
Creator: Nachammai41
Published: 2026-04-11 00:00:37
License: 暂无描述

Hugging Face2026-04-11 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/Nachammai41/unbanked-fraud-narratives

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: [] language: - en language_creators: [] license: [] multilinguality: - monolingual pretty_name: 'unbanked_fraud_narratives' size_categories: - 1K<n<10K source_datasets: - 'original' tags: - adaption - instruction-tuning - writing-editing-communication - personal-finance task_categories: [] task_ids: [] --- ![banner](https://proteus-prod-public.s3.us-east-1.amazonaws.com/temp/35ce37a6-5520-4fde-a518-69a88de607e6.png) This dataset is a remastered version prepared using [Adaption's](https://adaptionlabs.ai/app/auth) Adaptive Data platform. # unbanked_fraud_narratives This dataset contains prompts designed to generate first-person narratives from unbanked individuals involved in legitimate or fraudulent financial transactions. Each entry specifies details such as the fraud vector, financial instrument, transaction amount, and community context like payday loans or prepaid cards. The completions are currently empty, indicating this is a prompt-only collection for generating synthetic data on financial exploitation. ### Dataset size There are 3,766 data points in this dataset. This is an instruction tuning dataset. ### Quality of Remastered Dataset The final quality is A, with a relative quality improvement of 86.0%. ### Domain - Writing-editing-communication (90%) - Personal-finance (10%) ### Language - English (100%) ### Tone - Anecdotal (100%) ### Evaluation Results - **Quality Gains:** <img src="https://proteus-prod-public.s3.us-east-1.amazonaws.com/temp/602bbff5-edd1-4c3f-9c84-3849130605e2.png" alt="QualityGains" style="max-width: 50%; display: block; margin-left: auto; margin-right: auto;" /> - **Grade Improvement:** <img src="https://proteus-prod-public.s3.us-east-1.amazonaws.com/temp/13f8591b-bcab-4a5b-987c-a7cf5a91b5b0.png" alt="Grade" style="max-width: 50%; display: block; margin-left: auto; margin-right: auto;" /> - **Percentile Chart:** <img src="https://proteus-prod-public.s3.us-east-1.amazonaws.com/temp/649670a9-40e3-482a-a71b-92268d395ce0.png" alt="Percentile Chart" style="max-width: 50%; display: block; margin-left: auto; margin-right: auto;" /> # Underserved Financial Fraud Dataset ### Synthetic fraud detection data for underrepresented_communities **Created with Adaptive Data by Adaption** | CC BY 4.0 | 5 languages --- ## What This Is A synthetic financial fraud dataset covering **four underserved community archetypes** — populations that rely on remittance transfers, gig economy payouts, prepaid cards, and ITIN-based transactions. These communities are disproportionately targeted by fraud, yet no open-source fraud dataset has ever modeled their financial behavior. This dataset fills that gap. --- ## The Four Archetypes | Archetype | Who | Fraud Vectors | Languages | |---|---|---|---| | **Remittance Sender** | Immigrants sending money cross-border via Western Union, Remitly, MoneyGram | Emergency call scams, fake exchange rate bonuses, interception | es, ht, yo, hi, en | | **Gig Worker** | Uber, DoorDash, Instacart workers paid via CashApp, Venmo | Account takeover, SIM swap, fake platform support calls | en, hi, vi, es, yo | | **Unbanked Cash-In User** | Populations using prepaid cards and retail kiosks | Predatory micro-loans, load-fee scams, fake utility kiosks | en, es, vi, yo, hi | | **ITIN Entrepreneur** | Immigrant small business owners with no SSN | Synthetic identity fraud, fake tax returns, mule accounts | en, es, hi, ta, vi | ## Languages `en` English  |  `es` Spanish  |  `hi` Hinglish  |  `ht` Haitian Creole  |  `yo` Yoruba  |  `vi` Vietnamese  |  `ta` Tamil  |  `ta-en` Tamil-English --- ## What Makes It Different **No existing fraud dataset covers this population.** PaySim simulates generic mobile money. Sparkov models middle-class credit cards. IEEE-CIS captures e-commerce. None remittance kiosks, gig payouts, or ITIN-linked accounts. **Generated with diffusion, not rules.** Tabular data generated using Tab-DDPM (denoising diffusion for tabular data) — learns joint correlations across behavioral features, not just independent column sampling. Trained on A100 GPU via Google Colab Pro. **Multilingual narrative text.** Every fraud transaction has a `narrative_text` field — the scam message or fraud description in the community's language. Generated by Adaptive Data by Adaption. Quality score improved from E (5.0) to A (9.2–9.4). **Reasoning traces.** 390 chain-of-thought fraud analysis examples — step-by-step investigator reasoning grounded in community-specific fraud signals. No existing fraud dataset includes this. Built for fine-tuning financial language models (FinBERT, Gemma). --- ## How It Was Built ``` 1. Scrape 1,040 real fraud narratives from CFPB, BBB Scam Tracker, and Reddit archive (Pullpush.io) 2. Profile Behavioral distributions per archetype derived from scraped narratives — amounts, channels, corridors, fraud vectors, language mix 3. Generate Tab-DDPM trains on 5,000 seed rows per archetype, learns joint feature correlations, generates 5,000 synthetic transactions per archetype 4. Narrate Adaptive Data by Adaption fills narrative_text in 8 languages per transaction's fraud context 5. Trace 390 reasoning traces generated — chain-of-thought fraud analysis for fine-tuning use ``` --- ## Schema (Key Fields) | Field | Type | Description | |---|---|---| | `transaction_id` | uuid | Unique identifier | | `archetype` | categorical | remittance / gig_worker / unbanked / itin | | `amount_usd` | float | Transaction amount | | `channel` | categorical | retail_kiosk / mobile_app / p2p / bank_wire | | `fraud_vector` | categorical | Specific scam type | | `is_fraud` | bool | Ground truth label | | `fraud_confidence` | float | 0.0–1.0 label confidence | | `narrative_text` | string | Scam description in community language | | `narrative_language` | categorical | ISO 639-1 language code | | `reasoning_trace` | string | Chain-of-thought fraud analysis (sampled rows) | ## Intended Use - Training fraud detection models on underserved community transaction patterns - Benchmarking existing models (IEEE-CIS trained) against this population - Fine-tuning financial language models on multilingual fraud narratives - Research into AI fairness and financial inclusion - NLP research on under-resourced financial language --- ## What This Is Not This is a **fully synthetic** dataset. No real transaction data. No PII. Behavioral distributions are informed by public fraud narratives and World Bank remittance corridor data — not empirically measured transaction logs. Like all synthetic fraud datasets (PaySim, Sparkov, Cifer-AF), ground truth validation against real data is not possible due to privacy constraints. --- ## Origin This dataset was created as part of the **Uncharted Data Challenge** by Adaption Labs (April 2026). It extends the [Fraud Detection Framework](https://github.com/nachammai779/Fraud-Detection-Framework---An-Agentic-RAG-Pipeline-with-Custom-Financial-SLM) — an Agentic RAG pipeline with a custom Financial SLM built on the IEEE-CIS dataset (AUC-ROC 0.9486). The underserved dataset enables direct benchmarking: how does a model trained on mainstream data perform on populations it has never seen? --- ## Citation ```bibtex @dataset{palaniappan2026underserved, author = {Palaniappan, Nachammai}, title = {Underserved Financial Fraud Dataset}, year = {2026}, publisher = {HuggingFace}, note = {Created with Adaptive Data by Adaption. Uncharted Data Challenge, Adaption Labs.}, url = {https://huggingface.co/datasets/nachammai779/underserved-financial-fraud} } ``` --- ## Credits - **Adaptive Data by Adaption** — Narrative generation and dataset enrichment - **Tab-DDPM** (Kotelnikov et al., 2022) — Tabular diffusion model - **CFPB** — Consumer Financial Protection Bureau public complaint database - **BBB Scam Tracker** — Better Business Bureau public scam reports - **Pullpush.io** — Reddit archive API --- *License: CC BY 4.0 — Free to use with attribution*

提供机构：

Nachammai41

搜集汇总

数据集介绍

构建方式

在金融科技领域，针对未充分服务群体的欺诈检测数据长期匮乏，该数据集通过系统化方法填补了这一空白。构建过程始于从消费者金融保护局、商业改进局诈骗追踪平台及Reddit档案中搜集1040条真实欺诈叙述，以此为基础剖析四大群体原型的行为分布特征。随后运用Tab-DDPM表格扩散模型，基于各原型5000条种子数据学习联合特征关联，生成合成交易记录；最后通过自适应数据平台为每笔交易生成多语言叙事文本，并辅以390条链式思维推理轨迹，形成结构完整的合成数据集。

特点

该数据集的核心价值在于其开创性地聚焦于汇款发送者、零工经济从业者、无银行账户现金用户及ITIN企业家四大边缘化金融群体，这些群体在传统欺诈数据集中长期缺席。其技术特色体现在采用扩散模型生成具备复杂关联特征的表格数据，而非依赖规则模拟；同时每条记录均包含以社区母语撰写的欺诈叙事文本，覆盖英语、西班牙语、约鲁巴语等八种语言，并创新性地引入逐步推理的欺诈分析链条，为金融语言模型的微调提供了稀缺的语义素材。

使用方法

该数据集主要服务于金融公平性与包容性研究，为开发面向边缘化群体的欺诈检测模型提供训练基准。研究者可借助其多语言叙事文本微调金融领域语言模型，提升对非主流金融场景的语义理解能力；同时可通过对比在主流数据集上训练的模型在该数据集上的表现，评估算法公平性与泛化性能。使用时应明确其完全合成数据的特性，虽基于真实叙事构建行为分布，但需注意其与实证交易数据间的差异，适用于方法验证而非生产系统直接部署。

背景与挑战

背景概述

在金融科技与人工智能交叉领域，针对边缘化群体金融欺诈行为的数据资源长期匮乏。unbanked-fraud-narratives数据集由Adaption Labs于2026年创建，作为'Uncharted Data Challenge'项目的一部分，旨在填补这一空白。该数据集聚焦于汇款发送者、零工经济工作者、无银行账户现金用户及ITIN企业家等四类代表性弱势群体，通过合成数据技术模拟其面临的金融欺诈场景。核心研究问题在于如何构建能够反映特定社区金融行为模式与语言特征的多模态数据，以支持欺诈检测模型的公平性与包容性发展，对推动金融包容性人工智能研究具有开创性意义。

当前挑战

该数据集致力于解决金融欺诈检测领域中对弱势群体建模不足的核心挑战，传统欺诈数据集多基于主流金融交易模式，难以捕捉依赖汇款、预付卡等非传统金融渠道的社区特有的欺诈向量与行为相关性。在构建过程中，挑战主要体现在数据生成与真实性平衡上：一方面，需利用Tab-DDPM等扩散模型从有限的公开叙事中学习联合特征分布以生成合成数据，避免隐私泄露；另一方面，合成数据缺乏真实交易日志的实证基础，其行为分布的保真度与地面真实验证存在固有局限。此外，为覆盖多语言社区而生成的叙事文本，需确保文化语境与欺诈描述的语言准确性，这对数据生成技术提出了更高要求。

常用场景

经典使用场景

在金融科技与计算社会科学领域，unbanked-fraud-narratives数据集为研究金融服务边缘化群体的欺诈行为提供了关键资源。该数据集通过生成式方法构建了针对汇款发送者、零工经济工作者、无银行账户现金用户及ITIN企业家等四类代表性群体的合成交易数据，其核心应用场景在于训练和评估欺诈检测模型。研究者可利用这些包含多语言叙事文本和推理轨迹的结构化数据，模拟真实世界中的欺诈模式，从而弥补传统欺诈数据集在覆盖人群上的局限性，推动针对弱势群体的金融安全技术发展。

实际应用

在实际应用层面，该数据集为金融机构和科技公司开发更具包容性的反欺诈系统提供了训练素材。基于其合成的多语言欺诈叙事和交易特征，企业可以微调现有的金融语言模型或欺诈检测算法，使其能够识别针对特定社区的新型诈骗手段，如紧急呼叫诈骗、账户接管或虚假平台支持等。此外，数据集中的推理轨迹可直接用于构建可解释的AI辅助调查工具，帮助合规分析师理解欺诈决策过程，从而在支付清算、风险管控和消费者保护等领域提升操作效能与监管适应性。

衍生相关工作

围绕该数据集衍生的经典工作主要集中在合成数据生成与领域自适应模型的研究上。例如，基于Tab-DDPM的表格数据扩散方法被用于学习欺诈行为特征的联合分布，超越了传统的规则式生成技术。同时，该数据集常与IEEE-CIS等主流欺诈检测基准结合，用于评估模型在未见群体上的泛化能力，催生了多项关于跨领域迁移学习和少样本欺诈检测的研究。此外，其多语言叙事文本也促进了金融领域低资源语言的自然语言处理工作，为构建适应不同文化语境的欺诈叙事分类与生成模型提供了数据基础。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集