Nachammai41/remittance-fraud-narratives_with_reasoning

Name: Nachammai41/remittance-fraud-narratives_with_reasoning
Creator: Nachammai41
Published: 2026-04-10 23:02:19
License: 暂无描述

Hugging Face2026-04-10 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/Nachammai41/remittance-fraud-narratives_with_reasoning

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: [] language: - en language_creators: [] license: [] multilinguality: - monolingual pretty_name: 'remittance_fraud_narratives' size_categories: - 1K<n<10K source_datasets: - 'original' tags: - adaption - instruction-tuning - writing-editing-communication task_categories: [] task_ids: [] --- ![banner](https://proteus-prod-public.s3.us-east-1.amazonaws.com/temp/c505608e-9c11-4d4d-930b-cfc3264de346.png) This dataset is a remastered version prepared using [Adaption's](https://adaptionlabs.ai/app/auth) Adaptive Data platform. # remittance_fraud_narratives This dataset contains prompts designed to generate first-person narratives about financial transactions, specifically focusing on cross-border remittances within immigrant communities. Each entry specifies details such as the transaction archetype, fraud vector, financial instrument, amount, and sender demographics to guide the creation of realistic scam or legitimate transaction stories. The samples currently show null completions, indicating this is a prompt collection for generating synthetic data on financial fraud scenarios. ### Dataset size There are 3,460 data points in this dataset. This is an instruction tuning dataset. ### Quality of Remastered Dataset The final quality is A, with a relative quality improvement of 84.0%. ### Domain - Writing-editing-communication (100%) ### Language - English (100%) ### Tone - Anecdotal (100%) ### Evaluation Results - **Quality Gains:** <img src="https://proteus-prod-public.s3.us-east-1.amazonaws.com/temp/32e1d029-d564-4f5d-abfb-531d6433874c.png" alt="QualityGains" style="max-width: 50%; display: block; margin-left: auto; margin-right: auto;" /> - **Grade Improvement:** <img src="https://proteus-prod-public.s3.us-east-1.amazonaws.com/temp/d76000ee-61b0-4452-b26c-87582a582150.png" alt="Grade" style="max-width: 50%; display: block; margin-left: auto; margin-right: auto;" /> - **Percentile Chart:** <img src="https://proteus-prod-public.s3.us-east-1.amazonaws.com/temp/561b4784-9814-4e9b-9517-fa74615bf23d.png" alt="Percentile Chart" style="max-width: 50%; display: block; margin-left: auto; margin-right: auto;" /> # Underserved Financial Fraud Dataset ### Synthetic fraud detection data for underrepresented_communities **Created with Adaptive Data by Adaption** | CC BY 4.0 | 5 languages --- ## What This Is A synthetic financial fraud dataset covering **four underserved community archetypes** — populations that rely on remittance transfers, gig economy payouts, prepaid cards, and ITIN-based transactions. These communities are disproportionately targeted by fraud, yet no open-source fraud dataset has ever modeled their financial behavior. This dataset fills that gap. --- ## The Four Archetypes | Archetype | Who | Fraud Vectors | Languages | |---|---|---|---| | **Remittance Sender** | Immigrants sending money cross-border via Western Union, Remitly, MoneyGram | Emergency call scams, fake exchange rate bonuses, interception | es, ht, yo, hi, en | | **Gig Worker** | Uber, DoorDash, Instacart workers paid via CashApp, Venmo | Account takeover, SIM swap, fake platform support calls | en, hi, vi, es, yo | | **Unbanked Cash-In User** | Populations using prepaid cards and retail kiosks | Predatory micro-loans, load-fee scams, fake utility kiosks | en, es, vi, yo, hi | | **ITIN Entrepreneur** | Immigrant small business owners with no SSN | Synthetic identity fraud, fake tax returns, mule accounts | en, es, hi, ta, vi | --- ## Languages `en` English  |  `es` Spanish  |  `hi` Hinglish  |  `ht` Haitian Creole  |  `yo` Yoruba  |  `vi` Vietnamese  |  `ta` Tamil  |  `ta-en` Tamil-English --- ## What Makes It Different **No existing fraud dataset covers this population.** PaySim simulates generic mobile money. Sparkov models middle-class credit cards. IEEE-CIS captures e-commerce. None model remittance kiosks, gig payouts, or ITIN-linked accounts. **Generated with diffusion, not rules.** Tabular data generated using Tab-DDPM (denoising diffusion for tabular data) — learns joint correlations across behavioral features, not just independent column sampling. Trained on A100 GPU via Google Colab Pro. **Multilingual narrative text.** Every fraud transaction has a `narrative_text` field — the scam message or fraud description in the community's language. Generated by Adaptive Data by Adaption. Quality score improved from E (5.0) to A (9.2–9.4). **Reasoning traces.** 390 chain-of-thought fraud analysis examples — step-by-step investigator reasoning grounded in community-specific fraud signals. No existing fraud dataset includes this. Built for fine-tuning financial language models (FinBERT, Gemma). --- ## How It Was Built ``` 1. Scrape 1,040 real fraud narratives from CFPB, BBB Scam Tracker, and Reddit archive (Pullpush.io) 2. Profile Behavioral distributions per archetype derived from scraped narratives — amounts, channels, corridors, fraud vectors, language mix 3. Generate Tab-DDPM trains on 5,000 seed rows per archetype, learns joint feature correlations, generates 5,000 synthetic transactions per archetype 4. Narrate Adaptive Data by Adaption fills narrative_text in 8 languages per transaction's fraud context 5. Trace 390 reasoning traces generated — chain-of-thought fraud analysis for fine-tuning use ``` --- ## Schema (Key Fields) | Field | Type | Description | |---|---|---| | `transaction_id` | uuid | Unique identifier | | `archetype` | categorical | remittance / gig_worker / unbanked / itin | | `amount_usd` | float | Transaction amount | | `channel` | categorical | retail_kiosk / mobile_app / p2p / bank_wire | | `fraud_vector` | categorical | Specific scam type | | `is_fraud` | bool | Ground truth label | | `fraud_confidence` | float | 0.0–1.0 label confidence | | `narrative_text` | string | Scam description in community language | | `narrative_language` | categorical | ISO 639-1 language code | | `reasoning_trace` | string | Chain-of-thought fraud analysis (sampled rows) | ## Intended Use - Training fraud detection models on underserved community transaction patterns - Benchmarking existing models (IEEE-CIS trained) against this population - Fine-tuning financial language models on multilingual fraud narratives - Research into financial inclusion --- ## What This Is Not This is a **fully synthetic** dataset. No real transaction data. No PII. Behavioral distributions are informed by public fraud narratives and World Bank remittance corridor data — not empirically measured transaction logs. Like all synthetic fraud datasets (PaySim, Sparkov, Cifer-AF), ground truth validation against real data is not possible due to privacy constraints. --- ## Origin This dataset was created as part of the **Uncharted Data Challenge** by Adaption Labs (April 2026). It extends the [Fraud Detection Framework](https://github.com/nachammai779/Fraud-Detection-Framework---An-Agentic-RAG-Pipeline-with-Custom-Financial-SLM) — an Agentic RAG pipeline with a custom Financial SLM built on the IEEE-CIS dataset (AUC-ROC 0.9486). The underserved dataset enables direct benchmarking: how does a model trained on mainstream data perform on populations it has never seen? --- ## Citation ```bibtex @dataset{palaniappan2026underserved, author = {Palaniappan, Nachammai}, title = {Underserved Financial Fraud Dataset}, year = {2026}, publisher = {HuggingFace}, note = {Created with Adaptive Data by Adaption. Uncharted Data Challenge, Adaption Labs.}, url = {https://huggingface.co/datasets/nachammai779/underserved-financial-fraud} } ``` --- ## Credits - **Adaptive Data by Adaption** — Narrative generation and dataset enrichment - **Tab-DDPM** (Kotelnikov et al., 2022) — Tabular diffusion model - **CFPB** — Consumer Financial Protection Bureau public complaint database - **BBB Scam Tracker** — Better Business Bureau public scam reports - **Pullpush.io** — Reddit archive API --- *License: CC BY 4.0 — Free to use with attribution*

提供机构：

Nachammai41

搜集汇总

数据集介绍

构建方式

在金融欺诈检测领域，针对服务不足社区的数据长期匮乏。该数据集的构建过程体现了系统性合成方法，其基础源自对消费者金融保护局、商业改进局诈骗追踪平台及Reddit存档中1040条真实欺诈叙述的爬取与分析。基于这些叙述，研究团队提炼出汇款发送者、零工经济工作者、无银行账户现金用户及个人纳税人识别号企业家四大社区原型的交易行为分布特征。随后，采用表格去噪扩散概率模型，在学习了各特征间的联合相关性后，为每个原型生成了5000条合成交易记录。最后，通过自适应数据平台，为每条交易生成了多语言的欺诈叙述文本，并额外构建了390条包含逐步推理链条的欺诈分析示例，为模型微调提供了深度逻辑素材。

特点

本数据集的核心特点在于其开创性地聚焦于传统欺诈检测模型所忽视的服务不足社区。它并非简单模拟通用交易，而是精准刻画了跨境汇款、零工薪酬支付、预付卡使用及基于个人纳税人识别号的交易等特定场景下的欺诈模式。数据集深度融合了多语言叙事，每条欺诈交易均附有以社区常用语言描述的诈骗情境文本，覆盖英语、西班牙语、海地克里奥尔语等八种语言，极大增强了数据的真实性与文化贴合度。尤为突出的是，数据集包含了链式思维推理轨迹，以逐步演绎的方式揭示了欺诈分析的内在逻辑，这在现有开源欺诈数据集中尚属首次，为训练具备可解释性的金融语言模型提供了宝贵资源。

使用方法

该数据集主要服务于金融科技与人工智能交叉领域的研究与应用。研究者可利用其训练专门的欺诈检测模型，以捕捉服务不足社区独特的交易模式与欺诈特征，弥补现有模型在此类人群上的性能短板。同时，数据集为评估主流欺诈检测模型在未见群体上的泛化能力提供了基准测试平台。对于自然语言处理领域，其丰富的多语言叙事文本与推理链条，非常适合用于微调金融领域的预训练语言模型，以提升模型对复杂欺诈场景的理解与生成能力。此外，数据集也可支撑金融包容性相关研究，助力构建更公平、更具代表性的数字金融服务体系。

背景与挑战

背景概述

在金融科技与人工智能交叉领域，针对欺诈检测的研究长期受限于数据稀缺性，尤其是涉及跨境汇款、零工经济等边缘化社群的数据。2026年，Adaption Labs的研究人员Nachammai Palaniappan基于'Uncharted Data Challenge'项目，构建了'remittance-fraud-narratives_with_reasoning'数据集。该数据集旨在通过合成数据生成技术，模拟移民社群在跨境汇款场景中的欺诈与合法交易叙事，核心研究问题聚焦于提升金融欺诈检测模型在服务不足群体中的泛化能力与公平性。其创新性体现在融合多语言叙事文本与链式推理轨迹，为金融自然语言处理与欺诈分析提供了新的基准资源，推动了金融包容性领域的研究进展。

当前挑战

该数据集致力于解决金融欺诈检测领域中对边缘化社群交易模式识别不足的挑战，传统模型因缺乏代表性数据而在这些群体上表现不佳。构建过程中的主要挑战包括：一是数据稀缺性，真实交易数据因隐私限制难以获取，需依赖公开欺诈叙事与行为分布进行合成生成；二是多语言叙事生成的质量控制，需确保不同语言背景下欺诈描述的准确性与文化适配性；三是合成数据的真实性验证，如何使基于扩散模型生成的交易特征保持与真实场景的统计一致性，同时避免引入偏差，成为技术实现上的关键难点。

常用场景

经典使用场景

在金融欺诈检测领域，该数据集为研究跨境汇款欺诈行为提供了丰富的模拟场景。通过生成涉及移民社区的第一人称叙事，数据集能够模拟多种欺诈向量，如紧急呼叫诈骗、虚假汇率优惠等，为模型训练提供了高度结构化的提示集合。这些叙事不仅覆盖了交易类型、金融工具和人口统计细节，还通过合成数据生成技术，弥补了传统数据集在代表弱势群体金融行为方面的不足，成为指令微调和行为模式分析的重要资源。

解决学术问题

该数据集主要解决了金融欺诈检测研究中针对弱势群体数据稀缺的学术难题。传统欺诈检测模型往往基于主流交易数据训练，难以识别移民、零工工作者等社区特有的欺诈模式。通过提供多语言叙事文本和推理轨迹，数据集支持了对欺诈信号跨文化差异的深入探究，促进了金融包容性研究，并为开发更具泛化能力的检测算法奠定了数据基础。

衍生相关工作

该数据集衍生了一系列经典研究工作，包括基于Tab-DDPM的合成数据生成方法在金融领域的应用探索，以及针对弱势群体欺诈检测的基准测试框架。例如，在'Uncharted Data Challenge'中，它被用于扩展代理RAG管道，并与IEEE-CIS等主流数据集进行性能对比。这些工作推动了合成数据生成技术与金融语言模型（如FinBERT、Gemma）的结合，为欺诈分析中的链式推理提供了新范式。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集