Nachammai41/gig-worker-fraud-narratives_with_reasoning
收藏Hugging Face2026-04-10 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Nachammai41/gig-worker-fraud-narratives_with_reasoning
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators: []
language:
- en
language_creators: []
license: []
multilinguality:
- monolingual
pretty_name: 'gig_worker_fraud_narratives'
size_categories:
- n<1K
source_datasets:
- 'original'
tags:
- adaption
- instruction-tuning
- writing-editing-communication
task_categories: []
task_ids: []
---

This dataset is a remastered version prepared using [Adaption's](https://adaptionlabs.ai/app/auth) Adaptive Data platform.
# gig_worker_fraud_narratives
This dataset contains prompts designed to generate first-person narratives from gig economy workers who have experienced various types of financial fraud, such as account takeovers, hacking, and social engineering. Each prompt specifies details including the fraud vector, financial instrument involved, transaction amount, and sender age to guide the creation of realistic scam scenarios. The intended completions describe the incident, financial impact, and emotional response, though the provided samples currently show null completions.
### Dataset size
There are 100 data points in this dataset. This is an instruction tuning dataset.
### Quality of Remastered Dataset
The final quality is A, with a relative quality improvement of 84.0%.
### Domain
- Writing-editing-communication (86%)
### Language
- English (100%)
### Tone
- Anecdotal (94%)
### Evaluation Results
- **Quality Gains:**
<img src="https://proteus-prod-public.s3.us-east-1.amazonaws.com/temp/a402c48e-0c1c-4565-9f4c-2f29b1d8db4f.png" alt="QualityGains" style="max-width: 50%; display: block; margin-left: auto; margin-right: auto;" />
- **Grade Improvement:**
<img src="https://proteus-prod-public.s3.us-east-1.amazonaws.com/temp/5ebc228a-d844-4008-b602-c9579d7ef00c.png" alt="Grade" style="max-width: 50%; display: block; margin-left: auto; margin-right: auto;" />
- **Percentile Chart:**
<img src="https://proteus-prod-public.s3.us-east-1.amazonaws.com/temp/aad30e99-878c-46c9-9e79-3df0915161ec.png" alt="Percentile Chart" style="max-width: 50%; display: block; margin-left: auto; margin-right: auto;" />
# Underserved Financial Fraud Dataset
### Synthetic fraud detection data for underrepresented_communities
**Created with Adaptive Data by Adaption** | CC BY 4.0 | 5 languages
---
## What This Is
A synthetic financial fraud dataset covering **four underserved community archetypes** — populations that rely on remittance transfers, gig economy payouts, prepaid cards, and ITIN-based transactions. These communities are disproportionately targeted by fraud, yet no open-source fraud dataset has ever modeled their financial behavior.
This dataset fills that gap.
---
## The Four Archetypes
| Archetype | Who | Fraud Vectors | Languages |
|---|---|---|---|
| **Remittance Sender** | Immigrants sending money cross-border via Western Union, Remitly, MoneyGram | Emergency call scams, fake exchange rate bonuses, interception | es, ht, yo, hi, en |
| **Gig Worker** | Uber, DoorDash, Instacart workers paid via CashApp, Venmo | Account takeover, SIM swap, fake platform support calls | en, hi, vi, es, yo |
| **Unbanked Cash-In User** | Populations using prepaid cards and retail kiosks | Predatory micro-loans, load-fee scams, fake utility kiosks | en, es, vi, yo, hi |
| **ITIN Entrepreneur** | Immigrant small business owners with no SSN | Synthetic identity fraud, fake tax returns, mule accounts | en, es, hi, ta, vi |
## Languages
`en` English | `es` Spanish | `hi` Hinglish | `ht` Haitian Creole | `yo` Yoruba | `vi` Vietnamese | `ta` Tamil | `ta-en` Tamil-English
---
## What Makes It Different
**No existing fraud dataset covers this population.**
PaySim simulates generic mobile money. Sparkov models middle-class credit cards. IEEE-CIS captures e-commerce. None model remittance kiosks, gig payouts, or ITIN-linked accounts.
**Generated with diffusion, not rules.**
Tabular data generated using Tab-DDPM (denoising diffusion for tabular data) — learns joint correlations across behavioral features, not just independent column sampling. Trained on A100 GPU via Google Colab Pro.
**Multilingual narrative text.**
Every fraud transaction has a `narrative_text` field — the scam message or fraud description in the community's language. Generated by Adaptive Data by Adaption. Quality score improved from E (5.0) to A (9.2–9.4).
**Reasoning traces.**
390 chain-of-thought fraud analysis examples — step-by-step investigator reasoning grounded in community-specific fraud signals. No existing fraud dataset includes this. Built for fine-tuning financial language models (FinBERT, Gemma).
---
## How It Was Built
```
1. Scrape 1,040 real fraud narratives from CFPB, BBB Scam Tracker,
and Reddit archive (Pullpush.io)
2. Profile Behavioral distributions per archetype derived from
scraped narratives — amounts, channels, corridors,
fraud vectors, language mix
3. Generate Tab-DDPM trains on 5,000 seed rows per archetype,
learns joint feature correlations, generates 5,000
synthetic transactions per archetype
4. Narrate Adaptive Data by Adaption fills narrative_text in
8 languages per transaction's fraud context
5. Trace 390 reasoning traces generated — chain-of-thought
fraud analysis for fine-tuning use
```
---
## Schema (Key Fields)
| Field | Type | Description |
|---|---|---|
| `transaction_id` | uuid | Unique identifier |
| `archetype` | categorical | remittance / gig_worker / unbanked / itin |
| `amount_usd` | float | Transaction amount |
| `channel` | categorical | retail_kiosk / mobile_app / p2p / bank_wire |
| `fraud_vector` | categorical | Specific scam type |
| `is_fraud` | bool | Ground truth label |
| `fraud_confidence` | float | 0.0–1.0 label confidence |
| `narrative_text` | string | Scam description in community language |
| `narrative_language` | categorical | ISO 639-1 language code |
| `reasoning_trace` | string | Chain-of-thought fraud analysis (sampled rows) |
## Intended Use
- Training fraud detection models on underserved community transaction patterns
- Benchmarking existing models (IEEE-CIS trained) against this population
- Fine-tuning financial language models on multilingual fraud narratives
- Research into AI fairness and financial inclusion
- NLP research on under-resourced financial language
---
## What This Is Not
This is a **fully synthetic** dataset. No real transaction data. No PII. Behavioral distributions are informed by public fraud narratives and World Bank remittance corridor data — not empirically measured transaction logs. Like all synthetic fraud datasets (PaySim, Sparkov, Cifer-AF), ground truth validation against real data is not possible due to privacy constraints.
---
## Origin
This dataset was created as part of the **Uncharted Data Challenge** by Adaption Labs (April 2026). It extends the [Fraud Detection Framework](https://github.com/nachammai779/Fraud-Detection-Framework---An-Agentic-RAG-Pipeline-with-Custom-Financial-SLM) — an Agentic RAG pipeline with a custom Financial SLM built on the IEEE-CIS dataset (AUC-ROC 0.9486). The underserved dataset enables direct benchmarking: how does a model trained on mainstream data perform on populations it has never seen?
---
## Citation
```bibtex
@dataset{palaniappan2026underserved,
author = {Palaniappan, Nachammai},
title = {Underserved Financial Fraud Dataset},
year = {2026},
publisher = {HuggingFace},
note = {Created with Adaptive Data by Adaption.
Uncharted Data Challenge, Adaption Labs.},
url = {https://huggingface.co/datasets/nachammai779/underserved-financial-fraud}
}
```
---
## Credits
- **Adaptive Data by Adaption** — Narrative generation and dataset enrichment
- **Tab-DDPM** (Kotelnikov et al., 2022) — Tabular diffusion model
- **CFPB** — Consumer Financial Protection Bureau public complaint database
- **BBB Scam Tracker** — Better Business Bureau public scam reports
- **Pullpush.io** — Reddit archive API
---
*License: CC BY 4.0 — Free to use with attribution*
提供机构:
Nachammai41
搜集汇总
数据集介绍

构建方式
在金融欺诈检测领域,针对服务不足群体的数据稀缺问题日益凸显。本数据集通过系统化流程构建,首先从消费者金融保护局、商业改进局诈骗追踪器等公开渠道收集了1040条真实欺诈叙事,以此为基础剖析了不同群体(如汇款发送者、零工经济工作者)的行为分布特征。随后,采用基于扩散模型的Tab-DDPM方法,学习了各行为特征间的联合相关性,生成了每个群体5000条合成交易数据。最后,利用Adaption的自适应数据平台,为每条交易生成了多语言的欺诈叙事文本,并附带了390条链式推理分析轨迹,从而形成了结构完整、信息丰富的合成数据集。
使用方法
该数据集主要应用于提升金融欺诈检测模型的公平性与泛化能力。研究人员可利用其训练或微调欺诈检测模型,特别是针对服务不足群体的交易模式进行建模,以弥补现有模型在此类数据上的性能缺口。同时,数据集中的多语言叙事文本和推理轨迹,非常适合用于微调如FinBERT、Gemma等金融领域语言模型,增强其对复杂、多语言欺诈场景的理解与分析能力。此外,该数据集也可作为基准,用于评估在主流数据上训练的模型在面对未见群体时的表现,推动人工智能在金融包容性与公平性方面的研究。
背景与挑战
背景概述
在数字金融与零工经济深度融合的背景下,针对弱势群体的金融欺诈问题日益凸显,而相关研究却长期缺乏高质量、针对性的数据资源。gig-worker-fraud-narratives_with_reasoning数据集由Adaption Labs于2026年创建,作为其“未探索数据挑战”项目的一部分,旨在填补这一空白。该数据集专注于零工经济工作者遭遇的金融欺诈场景,通过合成数据生成技术,模拟了账户接管、黑客攻击、社交工程等多种欺诈向量的第一人称叙事。其核心研究问题在于如何为金融欺诈检测模型提供涵盖多语言、多文化背景的叙事文本与推理轨迹,以增强模型在真实世界中的泛化能力与公平性。该数据集的推出,不仅为金融科技领域的自然语言处理研究提供了新的基准,也推动了人工智能在金融包容性与伦理公平方面的探索。
当前挑战
该数据集致力于解决金融欺诈检测领域的一个关键挑战:如何有效识别并建模针对零工经济工作者等弱势群体的欺诈行为,这些群体的交易模式与主流金融数据存在显著差异,传统模型往往在此类场景下表现不佳。构建过程中的挑战主要体现在数据合成与质量保障方面。首先,由于涉及真实交易数据的隐私限制,数据集必须完全依赖合成生成,这要求采用先进的扩散模型(如Tab-DDPM)来学习行为特征的联合分布,而非简单规则抽样,以确保合成数据的统计真实性与相关性。其次,生成多语言、符合文化语境的欺诈叙事文本,并辅以逐步推理的思维链注释,需要克服自然语言生成在领域适应性与逻辑一致性上的技术难题。此外,如何基于有限的公开叙事资料(如消费者投诉与诈骗报告)准确刻画不同社区的行为分布,也是数据集构建中面临的重要挑战。
常用场景
经典使用场景
在金融欺诈检测领域,该数据集为研究零工经济工作者遭遇财务欺诈的叙事模式提供了关键素材。通过包含账户接管、黑客攻击和社会工程学等多种欺诈向量,数据集引导生成第一人称的详细叙述,这些叙述不仅描述了欺诈事件本身,还涵盖了财务影响和情感反应,为自然语言处理模型在欺诈情境下的文本生成与理解提供了结构化训练数据。
解决学术问题
该数据集针对金融欺诈研究中代表性不足的群体,如依赖汇款、零工经济支付、预付卡和ITIN交易的社区,填补了现有开源数据在建模这些群体财务行为方面的空白。它通过合成数据生成和链式推理追踪,解决了欺诈检测模型在跨群体泛化能力、多语言金融文本理解以及AI公平性评估等学术问题,推动了金融包容性研究的发展。
实际应用
在实际应用中,该数据集可用于训练和微调金融欺诈检测模型,特别是在针对零工工作者等边缘化群体的欺诈模式识别上。金融机构和科技公司可借助其多语言叙事文本和推理痕迹,开发更精准的监控系统,提升对新型欺诈手段的预警能力,同时促进金融服务在多样化社区中的安全性与可及性。
数据集最近研究
最新研究方向
在金融科技与人工智能交叉领域,针对零工经济等边缘化群体的欺诈检测研究正成为前沿热点。该数据集通过合成数据生成技术,特别是基于Tab-DDPM的扩散模型,模拟了账户接管、社交工程等复杂欺诈场景的多语言叙事文本与推理轨迹。其核心创新在于填补了主流欺诈数据集中对零工工作者、汇款发送者等未充分服务社区行为模式的空白,为训练具有公平性和包容性的金融语言模型提供了关键语料。当前研究聚焦于利用此类数据优化Agentic RAG管道与定制化金融SLM,旨在提升模型在跨文化、多语言欺诈叙事中的泛化能力与可解释性,推动金融包容性人工智能的发展。
以上内容由遇见数据集搜集并总结生成



