Nachammai41/gig-worker-fraud-narratives

Name: Nachammai41/gig-worker-fraud-narratives
Creator: Nachammai41
Published: 2026-04-11 00:02:02
License: 暂无描述

Hugging Face2026-04-11 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/Nachammai41/gig-worker-fraud-narratives

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: [] language: - en language_creators: [] license: [] multilinguality: - monolingual pretty_name: 'gig_worker_fraud_narratives' size_categories: - 1K<n<10K source_datasets: - 'original' tags: - adaption - instruction-tuning - writing-editing-communication task_categories: [] task_ids: [] --- ![banner](https://proteus-prod-public.s3.us-east-1.amazonaws.com/temp/624ed9a8-b7ba-4463-874c-b22ce2524de8.png) This dataset is a remastered version prepared using [Adaption's](https://adaptionlabs.ai/app/auth) Adaptive Data platform. # gig_worker_fraud_narratives This dataset contains prompts designed to generate first-person narratives from gig economy workers who have experienced various types of financial fraud, such as account takeover, hacking, and social engineering. Each prompt specifies details like the fraud vector, financial instrument, transaction amount, and sender age to guide the creation of realistic scam scenarios. The intended output focuses on the mechanics of the fraud, its financial impact, and the victim's emotional response within the context of the gig economy. ### Dataset size There are 4,533 data points in this dataset. This is an instruction tuning dataset. ### Quality of Remastered Dataset The final quality is A, with a relative quality improvement of 86.0%. ### Domain - Writing-editing-communication (100%) ### Language - English (100%) ### Tone - Anecdotal (100%) ### Evaluation Results - **Quality Gains:** <img src="https://proteus-prod-public.s3.us-east-1.amazonaws.com/temp/045d20e6-bcd8-4c74-b767-8016f2b7eca6.png" alt="QualityGains" style="max-width: 50%; display: block; margin-left: auto; margin-right: auto;" /> - **Grade Improvement:** <img src="https://proteus-prod-public.s3.us-east-1.amazonaws.com/temp/432ea378-3a57-4bf1-b56d-8f670cb69a9e.png" alt="Grade" style="max-width: 50%; display: block; margin-left: auto; margin-right: auto;" /> - **Percentile Chart:** <img src="https://proteus-prod-public.s3.us-east-1.amazonaws.com/temp/59494a8f-07cc-4324-b95a-1bff37e6c532.png" alt="Percentile Chart" style="max-width: 50%; display: block; margin-left: auto; margin-right: auto;" /> # Underserved Financial Fraud Dataset ### Synthetic fraud detection data for underrepresented_communities **Created with Adaptive Data by Adaption** | CC BY 4.0 | 5 languages --- ## What This Is A synthetic financial fraud dataset covering **four underserved community archetypes** — populations that rely on remittance transfers, gig economy payouts, prepaid cards, and ITIN-based transactions. These communities are disproportionately targeted by fraud, yet no open-source fraud dataset has ever modeled their financial behavior. This dataset fills that gap. --- ## The Four Archetypes | Archetype | Who | Fraud Vectors | Languages | |---|---|---|---| | **Remittance Sender** | Immigrants sending money cross-border via Western Union, Remitly, MoneyGram | Emergency call scams, fake exchange rate bonuses, interception | es, ht, yo, hi, en | | **Gig Worker** | Uber, DoorDash, Instacart workers paid via CashApp, Venmo | Account takeover, SIM swap, fake platform support calls | en, hi, vi, es, yo | | **Unbanked Cash-In User** | Populations using prepaid cards and retail kiosks | Predatory micro-loans, load-fee scams, fake utility kiosks | en, es, vi, yo, hi | | **ITIN Entrepreneur** | Immigrant small business owners with no SSN | Synthetic identity fraud, fake tax returns, mule accounts | en, es, hi, ta, vi | ## Languages `en` English  |  `es` Spanish  |  `hi` Hinglish  |  `ht` Haitian Creole  |  `yo` Yoruba  |  `vi` Vietnamese  |  `ta` Tamil  |  `ta-en` Tamil-English --- ## What Makes It Different **No existing fraud dataset covers this population.** PaySim simulates generic mobile money. Sparkov models middle-class credit cards. IEEE-CIS captures e-commerce. None remittance kiosks, gig payouts, or ITIN-linked accounts. **Generated with diffusion, not rules.** Tabular data generated using Tab-DDPM (denoising diffusion for tabular data) — learns joint correlations across behavioral features, not just independent column sampling. Trained on A100 GPU via Google Colab Pro. **Multilingual narrative text.** Every fraud transaction has a `narrative_text` field — the scam message or fraud description in the community's language. Generated by Adaptive Data by Adaption. Quality score improved from E (5.0) to A (9.2–9.4). **Reasoning traces.** 390 chain-of-thought fraud analysis examples — step-by-step investigator reasoning grounded in community-specific fraud signals. No existing fraud dataset includes this. Built for fine-tuning financial language models (FinBERT, Gemma). --- ## How It Was Built ``` 1. Scrape 1,040 real fraud narratives from CFPB, BBB Scam Tracker, and Reddit archive (Pullpush.io) 2. Profile Behavioral distributions per archetype derived from scraped narratives — amounts, channels, corridors, fraud vectors, language mix 3. Generate Tab-DDPM trains on 5,000 seed rows per archetype, learns joint feature correlations, generates 5,000 synthetic transactions per archetype 4. Narrate Adaptive Data by Adaption fills narrative_text in 8 languages per transaction's fraud context 5. Trace 390 reasoning traces generated — chain-of-thought fraud analysis for fine-tuning use ``` --- ## Schema (Key Fields) | Field | Type | Description | |---|---|---| | `transaction_id` | uuid | Unique identifier | | `archetype` | categorical | remittance / gig_worker / unbanked / itin | | `amount_usd` | float | Transaction amount | | `channel` | categorical | retail_kiosk / mobile_app / p2p / bank_wire | | `fraud_vector` | categorical | Specific scam type | | `is_fraud` | bool | Ground truth label | | `fraud_confidence` | float | 0.0–1.0 label confidence | | `narrative_text` | string | Scam description in community language | | `narrative_language` | categorical | ISO 639-1 language code | | `reasoning_trace` | string | Chain-of-thought fraud analysis (sampled rows) | ## Intended Use - Training fraud detection models on underserved community transaction patterns - Benchmarking existing models (IEEE-CIS trained) against this population - Fine-tuning financial language models on multilingual fraud narratives - Research into AI fairness and financial inclusion - NLP research on under-resourced financial language --- ## What This Is Not This is a **fully synthetic** dataset. No real transaction data. No PII. Behavioral distributions are informed by public fraud narratives and World Bank remittance corridor data — not empirically measured transaction logs. Like all synthetic fraud datasets (PaySim, Sparkov, Cifer-AF), ground truth validation against real data is not possible due to privacy constraints. --- ## Origin This dataset was created as part of the **Uncharted Data Challenge** by Adaption Labs (April 2026). It extends the [Fraud Detection Framework](https://github.com/nachammai779/Fraud-Detection-Framework---An-Agentic-RAG-Pipeline-with-Custom-Financial-SLM) — an Agentic RAG pipeline with a custom Financial SLM built on the IEEE-CIS dataset (AUC-ROC 0.9486). The underserved dataset enables direct benchmarking: how does a model trained on mainstream data perform on populations it has never seen? --- ## Citation ```bibtex @dataset{palaniappan2026underserved, author = {Palaniappan, Nachammai}, title = {Underserved Financial Fraud Dataset}, year = {2026}, publisher = {HuggingFace}, note = {Created with Adaptive Data by Adaption. Uncharted Data Challenge, Adaption Labs.}, url = {https://huggingface.co/datasets/nachammai779/underserved-financial-fraud} } ``` --- ## Credits - **Adaptive Data by Adaption** — Narrative generation and dataset enrichment - **Tab-DDPM** (Kotelnikov et al., 2022) — Tabular diffusion model - **CFPB** — Consumer Financial Protection Bureau public complaint database - **BBB Scam Tracker** — Better Business Bureau public scam reports - **Pullpush.io** — Reddit archive API --- *License: CC BY 4.0 — Free to use with attribution*

提供机构：

Nachammai41

搜集汇总

数据集介绍

构建方式

在金融欺诈检测领域，针对边缘化群体的数据稀缺问题长期存在。本数据集通过系统化流程构建，首先从消费者金融保护局、商业改进局诈骗追踪平台及Reddit档案中提取了1040条真实欺诈叙事作为基础。随后，基于这些叙事提炼出零工经济工作者等四个代表性群体的行为分布特征，涵盖交易金额、渠道及欺诈手段等维度。采用Tab-DDPM表格扩散模型，在A100 GPU上学习各特征间的联合相关性，为每个群体生成5000条合成交易记录。最后，通过Adaption的自适应数据平台，以多语言形式填充每笔交易的叙事文本，并添加390条链式思维推理轨迹，形成结构完整的指令微调数据集。

特点

该数据集的核心价值在于其聚焦于长期被主流金融数据忽视的边缘化群体，如依赖零工经济收入、汇款服务或预付卡的用户群体。其独特之处在于融合了多语言叙事文本与链式思维推理轨迹，每条交易记录均包含以社区常用语言描述的欺诈情境，同时提供逐步推理的欺诈分析示例。数据生成过程采用扩散模型而非规则引擎，能更准确地捕捉行为特征间的复杂关联。此外，数据集明确标注了欺诈置信度与交易渠道等关键字段，为模型训练提供了细粒度的监督信号。这些特性使其成为研究金融包容性与算法公平性的重要资源。

使用方法

该数据集主要服务于金融科技与自然语言处理领域的研究与实践。使用者可将其用于训练针对特定群体交易模式的欺诈检测模型，评估现有基于主流数据训练的模型在边缘化群体上的性能差异。数据集中的多语言叙事文本适用于微调金融领域语言模型，提升其对非标准金融语境的理解能力。链式思维推理轨迹则为构建可解释的欺诈分析系统提供了高质量的指令微调素材。在具体应用中，建议结合交易特征、叙事文本及推理轨迹进行多模态学习，并注意数据完全合成的特性，需在现实场景中谨慎验证模型泛化能力。

背景与挑战

背景概述

在数字经济蓬勃发展的时代背景下，零工经济的兴起为全球劳动力市场带来了灵活性与机遇，同时也伴随着新型金融欺诈风险的涌现。gig-worker-fraud-narratives数据集由Adaption Labs于2026年创建，作为其“未探索数据挑战”项目的一部分，旨在填补现有金融欺诈数据在零工经济等边缘化社群中的空白。该数据集的核心研究问题聚焦于模拟零工工作者遭遇账户接管、黑客攻击及社会工程学诈骗等情境下的第一人称叙事，从而为训练更具包容性的欺诈检测模型提供高质量的指令微调数据。通过生成包含多语言叙事文本与推理链的合成数据，该数据集推动了金融人工智能在公平性与可及性方面的前沿探索，对提升欺诈检测系统在多样化社会经济背景下的泛化能力具有重要影响力。

当前挑战

该数据集致力于解决金融欺诈检测领域长期存在的代表性偏差问题，即主流模型往往基于中产阶级信用卡或电子商务交易数据训练，难以识别零工经济等边缘化社群特有的欺诈模式，如通过现金应用或点对点支付渠道实施的诈骗。构建过程中的挑战主要体现在数据合成与真实性平衡上：一方面，需利用Tab-DDPM等扩散模型从有限的公开叙事中学习行为特征的联合分布，以生成符合现实统计规律的合成交易数据；另一方面，在缺乏真实交易日志验证的情况下，必须依赖消费者投诉档案等替代数据源来构建行为画像，这为数据保真度与隐私保护之间的权衡带来了复杂性。此外，生成具有文化语境敏感性的多语言诈骗叙事，并确保其情感与细节的真实性，亦是数据集构建中的关键难点。

常用场景

经典使用场景

在金融科技与自然语言处理交叉领域，该数据集为研究零工经济中的欺诈行为提供了独特的叙事资源。通过包含4533条精心设计的提示，它引导生成第一人称叙述，模拟零工工作者遭遇账户接管、黑客攻击和社会工程等欺诈场景。这些叙述不仅详细描述了欺诈的机制与财务影响，还捕捉了受害者的情感反应，为训练和评估语言模型在特定金融语境下的理解与生成能力奠定了数据基础。

衍生相关工作

该数据集衍生了多项经典研究工作，特别是在金融语言模型微调与公平性基准测试方面。例如，基于其构建的Agentic RAG管道与定制化金融SLM（如IEEE-CIS数据集扩展项目），实现了欺诈检测性能的显著提升。此外，数据集启发了对Tab-DDPM等表格扩散模型在合成金融数据中的应用探索，以及跨语言欺诈叙事生成技术的创新，为后续研究提供了可复现的基准与方法论参考。

数据集最近研究