Nachammai41/unbanked-fraud-narratives_with_reasoning
收藏Hugging Face2026-04-10 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Nachammai41/unbanked-fraud-narratives_with_reasoning
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators: []
language:
- en
language_creators: []
license: []
multilinguality:
- monolingual
pretty_name: 'unbanked_fraud_narratives'
size_categories:
- n<1K
source_datasets:
- 'original'
tags:
- adaption
- instruction-tuning
- writing-editing-communication
- personal-finance
task_categories: []
task_ids: []
---

This dataset is a remastered version prepared using [Adaption's](https://adaptionlabs.ai/app/auth) Adaptive Data platform.
# unbanked_fraud_narratives
This dataset contains prompts designed to generate first-person narratives from unbanked individuals involved in either fraudulent or legitimate financial transactions. Each prompt specifies details such as the fraud vector, financial instrument, transaction amount, sender age, and language to guide the creation of realistic scenarios. The content focuses on community contexts involving prepaid cards, payday loans, and kiosk fraud, aiming to capture the financial impact and emotional responses of the participants.
### Dataset size
There are 98 data points in this dataset. This is an instruction tuning dataset.
### Quality of Remastered Dataset
The final quality is A, with a relative quality improvement of 86.0%.
### Domain
- Writing-editing-communication (82%)
- Personal-finance (6%)
### Language
- English (100%)
### Tone
- Anecdotal (100%)
### Evaluation Results
- **Quality Gains:**
<img src="https://proteus-prod-public.s3.us-east-1.amazonaws.com/temp/d02f2dc5-7879-4a95-bee8-82aecffc678a.png" alt="QualityGains" style="max-width: 50%; display: block; margin-left: auto; margin-right: auto;" />
- **Grade Improvement:**
<img src="https://proteus-prod-public.s3.us-east-1.amazonaws.com/temp/5776c738-aae7-4f32-bb4a-d20a4c46f4af.png" alt="Grade" style="max-width: 50%; display: block; margin-left: auto; margin-right: auto;" />
- **Percentile Chart:**
<img src="https://proteus-prod-public.s3.us-east-1.amazonaws.com/temp/39e01589-2064-4f4b-8d2f-a8f90375a613.png" alt="Percentile Chart" style="max-width: 50%; display: block; margin-left: auto; margin-right: auto;" />
# Underserved Financial Fraud Dataset
### Synthetic fraud detection data for underrepresented_communities
**Created with Adaptive Data by Adaption** | CC BY 4.0 | 5 languages
---
## What This Is
A synthetic financial fraud dataset covering **four underserved community archetypes** — populations that rely on remittance transfers, gig economy payouts, prepaid cards, and ITIN-based transactions. These communities are disproportionately targeted by fraud, yet no open-source fraud dataset has ever modeled their financial behavior.
This dataset fills that gap.
---
## The Four Archetypes
| Archetype | Who | Fraud Vectors | Languages |
|---|---|---|---|
| **Remittance Sender** | Immigrants sending money cross-border via Western Union, Remitly, MoneyGram | Emergency call scams, fake exchange rate bonuses, interception | es, ht, yo, hi, en |
| **Gig Worker** | Uber, DoorDash, Instacart workers paid via CashApp, Venmo | Account takeover, SIM swap, fake platform support calls | en, hi, vi, es, yo |
| **Unbanked Cash-In User** | Populations using prepaid cards and retail kiosks | Predatory micro-loans, load-fee scams, fake utility kiosks | en, es, vi, yo, hi |
| **ITIN Entrepreneur** | Immigrant small business owners with no SSN | Synthetic identity fraud, fake tax returns, mule accounts | en, es, hi, ta, vi |
## Languages
`en` English | `es` Spanish | `hi` Hinglish | `ht` Haitian Creole | `yo` Yoruba | `vi` Vietnamese | `ta` Tamil | `ta-en` Tamil-English
---
## What Makes It Different
**No existing fraud dataset covers this population.**
PaySim simulates generic mobile money. Sparkov models middle-class credit cards. IEEE-CIS captures e-commerce. None remittance kiosks, gig payouts, or ITIN-linked accounts.
**Generated with diffusion, not rules.**
Tabular data generated using Tab-DDPM (denoising diffusion for tabular data) — learns joint correlations across behavioral features, not just independent column sampling. Trained on A100 GPU via Google Colab Pro.
**Multilingual narrative text.**
Every fraud transaction has a `narrative_text` field — the scam message or fraud description in the community's language. Generated by Adaptive Data by Adaption. Quality score improved from E (5.0) to A (9.2–9.4).
**Reasoning traces.**
390 chain-of-thought fraud analysis examples — step-by-step investigator reasoning grounded in community-specific fraud signals. No existing fraud dataset includes this. Built for fine-tuning financial language models (FinBERT, Gemma).
---
## How It Was Built
```
1. Scrape 1,040 real fraud narratives from CFPB, BBB Scam Tracker,
and Reddit archive (Pullpush.io)
2. Profile Behavioral distributions per archetype derived from
scraped narratives — amounts, channels, corridors,
fraud vectors, language mix
3. Generate Tab-DDPM trains on 5,000 seed rows per archetype,
learns joint feature correlations, generates 5,000
synthetic transactions per archetype
4. Narrate Adaptive Data by Adaption fills narrative_text in
8 languages per transaction's fraud context
5. Trace 390 reasoning traces generated — chain-of-thought
fraud analysis for fine-tuning use
```
---
## Schema (Key Fields)
| Field | Type | Description |
|---|---|---|
| `transaction_id` | uuid | Unique identifier |
| `archetype` | categorical | remittance / gig_worker / unbanked / itin |
| `amount_usd` | float | Transaction amount |
| `channel` | categorical | retail_kiosk / mobile_app / p2p / bank_wire |
| `fraud_vector` | categorical | Specific scam type |
| `is_fraud` | bool | Ground truth label |
| `fraud_confidence` | float | 0.0–1.0 label confidence |
| `narrative_text` | string | Scam description in community language |
| `narrative_language` | categorical | ISO 639-1 language code |
| `reasoning_trace` | string | Chain-of-thought fraud analysis (sampled rows) |
## Intended Use
- Training fraud detection models on underserved community transaction patterns
- Benchmarking existing models (IEEE-CIS trained) against this population
- Fine-tuning financial language models on multilingual fraud narratives
- Research into AI fairness and financial inclusion
- NLP research on under-resourced financial language
---
## What This Is Not
This is a **fully synthetic** dataset. No real transaction data. No PII. Behavioral distributions are informed by public fraud narratives and World Bank remittance corridor data — not empirically measured transaction logs. Like all synthetic fraud datasets (PaySim, Sparkov, Cifer-AF), ground truth validation against real data is not possible due to privacy constraints.
---
## Origin
This dataset was created as part of the **Uncharted Data Challenge** by Adaption Labs (April 2026). It extends the [Fraud Detection Framework](https://github.com/nachammai779/Fraud-Detection-Framework---An-Agentic-RAG-Pipeline-with-Custom-Financial-SLM) — an Agentic RAG pipeline with a custom Financial SLM built on the IEEE-CIS dataset (AUC-ROC 0.9486). The underserved dataset enables direct benchmarking: how does a model trained on mainstream data perform on populations it has never seen?
---
## Citation
```bibtex
@dataset{palaniappan2026underserved,
author = {Palaniappan, Nachammai},
title = {Underserved Financial Fraud Dataset},
year = {2026},
publisher = {HuggingFace},
note = {Created with Adaptive Data by Adaption.
Uncharted Data Challenge, Adaption Labs.},
url = {https://huggingface.co/datasets/nachammai779/underserved-financial-fraud}
}
```
---
## Credits
- **Adaptive Data by Adaption** — Narrative generation and dataset enrichment
- **Tab-DDPM** (Kotelnikov et al., 2022) — Tabular diffusion model
- **CFPB** — Consumer Financial Protection Bureau public complaint database
- **BBB Scam Tracker** — Better Business Bureau public scam reports
- **Pullpush.io** — Reddit archive API
---
*License: CC BY 4.0 — Free to use with attribution*
提供机构:
Nachammai41
搜集汇总
数据集介绍

构建方式
在金融欺诈检测领域,针对未充分服务社区的数据稀缺问题,本数据集通过系统化流程构建。首先从消费者金融保护局和商业改进局等公开渠道收集了1040条真实欺诈叙事,以此为基础分析不同社区原型的交易行为分布。随后采用Tab-DDPM表格扩散模型,基于5000条种子数据学习各特征间的联合相关性,为每个社区原型生成5000条合成交易记录。最后通过Adaption的自适应数据平台,以多语言形式填充每条交易的叙事文本,并生成390条包含逐步推理链条的欺诈分析示例,形成结构完整的指令微调数据集。
特点
本数据集的核心特点在于其专注于传统金融欺诈数据集中常被忽视的未银行化人群,涵盖汇款发送者、零工经济工作者等四种社区原型。数据集采用扩散模型生成合成数据,能够捕捉行为特征间的复杂关联,而非简单的独立采样。每条交易记录均包含多语言叙事文本字段,以社区常用语言描述欺诈场景,并首次引入推理轨迹字段,提供链式思维的欺诈分析过程。这些设计使得数据集不仅能模拟特定人群的交易模式,还为金融语言模型的微调提供了丰富的语义素材。
使用方法
该数据集主要应用于训练针对未充分服务社区的欺诈检测模型,研究者可基于其多语言叙事文本和结构化交易特征,开发能够识别特定欺诈向量的分类算法。数据集包含的推理轨迹可直接用于微调金融领域语言模型,提升模型对复杂欺诈场景的逐步推理能力。同时,该数据集可作为基准测试工具,评估在主流数据上训练的模型在面对未见人群时的性能表现,推动金融人工智能领域的公平性与包容性研究。使用时应遵循CC BY 4.0许可协议,并注意其完全合成数据的特性,需结合领域知识进行结果验证。
背景与挑战
背景概述
在金融科技与人工智能交叉领域,针对金融服务不足群体的欺诈检测研究长期面临数据稀缺的困境。unbanked-fraud-narratives_with_reasoning数据集由Adaption Labs于2026年创建,作为“Uncharted Data Challenge”项目的一部分,旨在填补这一空白。该数据集聚焦于汇款发送者、零工经济工作者、无银行账户现金用户及ITIN企业家等四类代表性群体,通过合成数据模拟其金融交易行为。核心研究问题在于如何构建能够反映边缘化社区真实欺诈模式的多语言叙事数据,以支持公平、包容的金融人工智能模型开发。该数据集的推出为金融包容性研究和欺诈检测算法的公平性评估提供了关键资源,推动了相关领域从主流数据向多元化场景的拓展。
当前挑战
该数据集致力于解决金融服务不足群体中欺诈检测的领域挑战,这些群体因依赖预付卡、汇款服务等非传统金融工具而常被主流欺诈模型忽略。构建过程中的首要挑战在于真实数据的匮乏,研究者需依赖公开的欺诈叙事与报告,通过Tab-DDPM扩散模型合成具有联合特征相关性的交易数据,而非直接使用实证交易日志。其次,生成高质量的多语言叙事文本涉及复杂的语言适应任务,需在英语、西班牙语、约鲁巴语等八种语言中准确捕捉不同社区的欺诈语境与文化细微差异。此外,为增强模型的可解释性,数据集需构建链式推理追踪,模拟调查员的逐步分析过程,这对叙事逻辑与领域知识的融合提出了较高要求。
常用场景
经典使用场景
在金融欺诈检测领域,该数据集为研究者和开发者提供了一个独特的资源,专注于模拟未银行化人群在欺诈或合法交易中的第一人称叙事。通过包含预付卡、发薪日贷款和自助服务终端欺诈等社区情境,数据集能够生成高度逼真的场景描述,这些描述融合了具体的欺诈向量、金融工具和情感反应。经典使用场景涉及利用这些叙事进行指令微调,以增强大型语言模型在理解边缘化群体金融行为方面的能力,从而支持更精准的文本生成和分析任务。
解决学术问题
该数据集致力于解决金融科技研究中长期存在的代表性不足问题,传统欺诈检测模型往往基于主流人群数据,忽略了未银行化社区特有的交易模式。通过合成涵盖汇款发送者、零工经济工作者等四种原型的数据,它填补了开放资源中缺乏针对弱势群体欺诈行为建模的空白。其意义在于促进了人工智能公平性和金融包容性研究,使学者能够探索模型在未见人群上的性能偏差,推动更公正的算法设计。
衍生相关工作
该数据集衍生了多项经典研究工作,例如基于其构建的Agentic RAG管道与定制金融小型语言模型,这些模型在IEEE-CIS数据集上实现了高达0.9486的AUC-ROC性能。相关研究进一步利用数据集的链式思维推理轨迹,微调如FinBERT和Gemma等模型,以增强对多语言欺诈叙事的理解能力。这些工作不仅扩展了欺诈检测框架的适用范围,还为公平AI和金融语言处理领域提供了新的方法论基础。
以上内容由遇见数据集搜集并总结生成



