Nachammai41/itin_fraud_narratives_with_reasoning

Name: Nachammai41/itin_fraud_narratives_with_reasoning
Creator: Nachammai41
Published: 2026-04-10 23:13:37
License: 暂无描述

Hugging Face2026-04-10 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/Nachammai41/itin_fraud_narratives_with_reasoning

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: [] language: - en language_creators: [] license: [] multilinguality: - monolingual pretty_name: 'itin_fraud_narratives' size_categories: - n<1K source_datasets: - 'original' tags: - adaption - instruction-tuning - writing-editing-communication - legal task_categories: [] task_ids: [] --- ![banner](https://proteus-prod-public.s3.us-east-1.amazonaws.com/temp/51a38fce-27f7-4eb2-a4f0-8be17cb21459.png) This dataset is a remastered version prepared using [Adaption's](https://adaptionlabs.ai/app/auth) Adaptive Data platform. # itin_fraud_narratives This dataset contains prompts designed to generate first-person narratives about financial fraud targeting ITIN holders, including identity theft and synthetic identity schemes. Each entry specifies details such as the fraud vector, financial instrument, transaction amount, and community context to guide the generation of realistic scam scenarios. The completions are currently empty, indicating this is a prompt-only collection for training or evaluation purposes. ### Dataset size There are 93 data points in this dataset. This is an instruction tuning dataset. ### Quality of Remastered Dataset The final quality is A, with a relative quality improvement of 88.0%. ### Domain - Writing-editing-communication (62%) - Legal (6%) ### Language - English (100%) ### Tone - Anecdotal (90%) ### Evaluation Results - **Quality Gains:** <img src="https://proteus-prod-public.s3.us-east-1.amazonaws.com/temp/df6c9bd7-fa34-42ba-8934-7a17484e81c6.png" alt="QualityGains" style="max-width: 50%; display: block; margin-left: auto; margin-right: auto;" /> - **Grade Improvement:** <img src="https://proteus-prod-public.s3.us-east-1.amazonaws.com/temp/902db584-cbf1-486c-b424-a41d6e028017.png" alt="Grade" style="max-width: 50%; display: block; margin-left: auto; margin-right: auto;" /> - **Percentile Chart:** <img src="https://proteus-prod-public.s3.us-east-1.amazonaws.com/temp/5a5c2ac5-7fe9-4f27-bc0e-92647209ad45.png" alt="Percentile Chart" style="max-width: 50%; display: block; margin-left: auto; margin-right: auto;" /> # Underserved Financial Fraud Dataset ### Synthetic fraud detection data for underrepresented_communities **Created with Adaptive Data by Adaption** | CC BY 4.0 | 20,000 records | 8 languages --- ## What This Is A synthetic financial fraud dataset covering **four underserved community archetypes** — populations that rely on remittance transfers, gig economy payouts, prepaid cards, and ITIN-based transactions. These communities are disproportionately targeted by fraud, yet no open-source fraud dataset has ever modeled their financial behavior. This dataset fills that gap. --- ## The Four Archetypes | Archetype | Who | Fraud Vectors | Languages | |---|---|---|---| | **Remittance Sender** | Immigrants sending money cross-border via Western Union, Remitly, MoneyGram | Emergency call scams, fake exchange rate bonuses, interception | es, ht, yo, hi, en | | **Gig Worker** | Uber, DoorDash, Instacart workers paid via CashApp, Venmo | Account takeover, SIM swap, fake platform support calls | en, hi, vi, es, yo | | **Unbanked Cash-In User** | Populations using prepaid cards and retail kiosks | Predatory micro-loans, load-fee scams, fake utility kiosks | en, es, vi, yo, hi | | **ITIN Entrepreneur** | Immigrant small business owners with no SSN | Synthetic identity fraud, fake tax returns, mule accounts | en, es, hi, ta, vi | --- ## Languages `en` English  |  `es` Spanish  |  `hi` Hinglish  |  `ht` Haitian Creole  |  `yo` Yoruba  |  `vi` Vietnamese  |  `ta` Tamil  |  `ta-en` Tamil-English --- ## What Makes It Different **No existing fraud dataset covers this population.** PaySim simulates generic mobile money. Sparkov models middle-class credit cards. IEEE-CIS captures e-commerce. None model remittance kiosks, gig payouts, or ITIN-linked accounts. **Generated with diffusion, not rules.** Tabular data generated using Tab-DDPM (denoising diffusion for tabular data) — learns joint correlations across behavioral features, not just independent column sampling. Trained on A100 GPU via Google Colab Pro. **Multilingual narrative text.** Every fraud transaction has a `narrative_text` field — the scam message or fraud description in the community's language. Generated by Adaptive Data by Adaption. Quality score improved from E (5.0) to A (9.2–9.4). **Reasoning traces.** 390 chain-of-thought fraud analysis examples — step-by-step investigator reasoning grounded in community-specific fraud signals. No existing fraud dataset includes this. Built for fine-tuning financial language models (FinBERT, Gemma). --- ## How It Was Built ``` 1. Scrape 1,040 real fraud narratives from CFPB, BBB Scam Tracker, and Reddit archive (Pullpush.io) 2. Profile Behavioral distributions per archetype derived from scraped narratives — amounts, channels, corridors, fraud vectors, language mix 3. Generate Tab-DDPM trains on 5,000 seed rows per archetype, learns joint feature correlations, generates 5,000 synthetic transactions per archetype 4. Narrate Adaptive Data by Adaption fills narrative_text in 8 languages per transaction's fraud context 5. Trace 390 reasoning traces generated — chain-of-thought fraud analysis for fine-tuning use ``` --- ## Schema (Key Fields) | Field | Type | Description | |---|---|---| | `transaction_id` | uuid | Unique identifier | | `archetype` | categorical | remittance / gig_worker / unbanked / itin | | `amount_usd` | float | Transaction amount | | `channel` | categorical | retail_kiosk / mobile_app / p2p / bank_wire | | `fraud_vector` | categorical | Specific scam type | | `is_fraud` | bool | Ground truth label | | `fraud_confidence` | float | 0.0–1.0 label confidence | | `narrative_text` | string | Scam description in community language | | `narrative_language` | categorical | ISO 639-1 language code | | `reasoning_trace` | string | Chain-of-thought fraud analysis (sampled rows) | ## Intended Use - Training fraud detection models on underserved community transaction patterns - Benchmarking existing models (IEEE-CIS trained) against this population - Fine-tuning financial language models on multilingual fraud narratives - Research into AI fairness and financial inclusion - NLP research on under-resourced financial language --- ## What This Is Not This is a **fully synthetic** dataset. No real transaction data. No PII. Behavioral distributions are informed by public fraud narratives and World Bank remittance corridor data — not empirically measured transaction logs. Like all synthetic fraud datasets (PaySim, Sparkov, Cifer-AF), ground truth validation against real data is not possible due to privacy constraints. --- ## Origin This dataset was created as part of the **Uncharted Data Challenge** by Adaption Labs (April 2026). It extends the [Fraud Detection Framework](https://github.com/nachammai779/Fraud-Detection-Framework---An-Agentic-RAG-Pipeline-with-Custom-Financial-SLM) — an Agentic RAG pipeline with a custom Financial SLM built on the IEEE-CIS dataset (AUC-ROC 0.9486). The underserved dataset enables direct benchmarking: how does a model trained on mainstream data perform on populations it has never seen? --- ## Citation ```bibtex @dataset{palaniappan2026underserved, author = {Palaniappan, Nachammai}, title = {Underserved Financial Fraud Dataset}, year = {2026}, publisher = {HuggingFace}, note = {Created with Adaptive Data by Adaption. Uncharted Data Challenge, Adaption Labs.}, url = {https://huggingface.co/datasets/nachammai779/underserved-financial-fraud} } ``` --- ## Credits - **Adaptive Data by Adaption** — Narrative generation and dataset enrichment - **Tab-DDPM** (Kotelnikov et al., 2022) — Tabular diffusion model - **CFPB** — Consumer Financial Protection Bureau public complaint database - **BBB Scam Tracker** — Better Business Bureau public scam reports - **Pullpush.io** — Reddit archive API --- *License: CC BY 4.0 — Free to use with attribution*

提供机构：

Nachammai41

搜集汇总

数据集介绍

构建方式

在金融欺诈检测领域，针对少数群体数据稀缺的现状，该数据集通过系统化流程构建而成。首先从消费者金融保护局和商业改进局等公开渠道收集了1040个真实欺诈案例作为种子数据，随后基于这些案例提取了不同群体在交易金额、渠道和欺诈模式上的行为分布特征。利用Tab-DDPM这一表格扩散模型，模型学习了各特征间的联合相关性，为每个群体生成了5000条合成交易记录。最后通过Adaption的自适应数据平台，为每条记录生成了多语言欺诈叙述文本，并补充了390条包含逐步推理链条的欺诈分析示例，形成了结构完整的指令微调数据集。

特点

该数据集的核心特点在于其专注于金融服务中的边缘群体，涵盖了依赖汇款、零工经济收入、预付卡和ITIN交易的四大群体，这些群体在传统欺诈数据集中往往被忽视。数据集采用扩散模型生成合成数据，能够捕捉复杂的特征关联，而非简单的独立采样。每条记录均包含多语言叙述文本，以社区常用语言描述欺诈场景，增强了数据的真实性和文化相关性。特别值得注意的是，数据集包含了链式推理轨迹，为模型提供了欺诈分析的逻辑框架，这在现有金融数据集中较为罕见。

使用方法

该数据集主要应用于金融科技和自然语言处理领域的研究与实践。研究人员可利用其训练欺诈检测模型，特别是针对传统数据未能充分覆盖的边缘群体交易模式，以评估和提升模型的公平性与泛化能力。数据集中的多语言叙述文本适用于微调金融领域的预训练语言模型，如FinBERT或Gemma，以增强其对特定金融语境和语言的理解。此外，内含的推理轨迹为构建可解释的欺诈分析系统提供了宝贵资源，支持链式思维微调，推动人工智能在金融安全领域的透明化应用。

背景与挑战

背景概述

在金融科技与人工智能交叉领域，针对少数群体金融欺诈检测的数据资源长期匮乏。2026年，Adaption Labs的研究人员Nachammai Palaniappan通过‘Uncharted Data Challenge’项目，构建了itin_fraud_narratives_with_reasoning数据集。该数据集聚焦于持有个人纳税人识别号的移民群体，专门模拟针对该群体的身份盗窃与合成身份欺诈等金融诈骗场景。其核心研究问题在于填补主流欺诈检测模型在服务不足社区上的数据空白，通过生成包含多语言叙事文本与推理链的合成数据，旨在提升人工智能模型在金融包容性与公平性方面的表现，为相关领域的算法公平性研究提供了关键数据基础。

当前挑战

该数据集致力于解决金融欺诈检测领域中对服务不足社区建模的挑战，传统数据集如PaySim或IEEE-CIS主要覆盖主流交易模式，难以捕捉依赖汇款、零工经济支付等特定金融行为的欺诈特征。在构建过程中，挑战主要体现在数据合成与真实性平衡上：由于隐私限制无法使用真实交易日志，需依赖公开欺诈叙事与世界银行数据推断行为分布；同时，利用Tab-DDPM等扩散模型生成具有联合特征相关性的合成表格数据，并确保多语言叙事文本的质量提升，其质量评分从初始的E级优化至A级，这一过程涉及复杂的数据生成与验证步骤。

常用场景

经典使用场景

在金融欺诈检测领域，itin_fraud_narratives数据集作为指令微调资源，其经典应用场景聚焦于生成针对ITIN持有者的欺诈叙事。通过提供包含欺诈向量、金融工具、交易金额及社区背景的详细提示，该数据集能够引导模型构建高度逼真的诈骗情境，从而服务于自然语言生成任务的训练与评估。这一过程不仅强化了模型对特定欺诈模式的理解，还为模拟复杂金融犯罪场景提供了结构化框架，推动了生成式人工智能在金融安全领域的深度应用。

实际应用

在实际应用层面，该数据集为金融机构与监管科技提供了关键支持。通过模拟ITIN相关欺诈场景，它能够训练检测系统识别身份盗窃、合成身份诈骗等新兴威胁，提升对高风险群体的保护能力。同时，其多语言叙事字段助力开发跨文化反欺诈沟通工具，优化客户教育材料与预警机制。这些应用不仅增强了金融服务的包容性，也为政策制定者提供了基于实证的决策参考，切实维护了金融生态的稳定与公正。

衍生相关工作

围绕该数据集衍生的经典工作包括基于Tab-DDPM的合成数据生成框架与链式思维推理模型。研究者利用其叙事结构开发了专门针对边缘化社区的欺诈检测管道，如集成自适应数据平台的智能体RAG系统。这些工作进一步拓展至多语言金融语言模型的微调，例如在FinBERT与Gemma架构上融入欺诈推理能力。相关成果已形成跨学科研究脉络，持续推动着公平机器学习与金融犯罪学的前沿探索。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集