Nachammai41/itin-fraud-narratives
收藏Hugging Face2026-04-11 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Nachammai41/itin-fraud-narratives
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators: []
language:
- en
language_creators: []
license: []
multilinguality:
- monolingual
pretty_name: 'itin_fraud_narratives'
size_categories:
- 1K<n<10K
source_datasets:
- 'original'
tags:
- adaption
- instruction-tuning
- writing-editing-communication
- legal
task_categories: []
task_ids: []
---

This dataset is a remastered version prepared using [Adaption's](https://adaptionlabs.ai/app/auth) Adaptive Data platform.
# itin_fraud_narratives
This dataset contains prompt templates designed to generate first-person narratives about financial fraud targeting ITIN holders. Each entry specifies variables such as fraud vector, financial instrument, transaction amount, and language to guide the creation of synthetic victim stories. The samples focus on scenarios involving identity theft, tax fraud, and synthetic identity crimes within immigrant communities. All provided completions in the sample are null, indicating this is a prompt-only collection for data generation tasks.
### Dataset size
There are 3,317 data points in this dataset. This is an instruction tuning dataset.
### Quality of Remastered Dataset
The final quality is A, with a relative quality improvement of 88.0%.
### Domain
- Writing-editing-communication (92%)
- Legal (8%)
### Language
- English (100%)
### Tone
- Anecdotal (100%)
### Evaluation Results
- **Quality Gains:**
<img src="https://proteus-prod-public.s3.us-east-1.amazonaws.com/temp/5e77a1cd-d5dc-43f1-831e-dba8fa5ccfd6.png" alt="QualityGains" style="max-width: 50%; display: block; margin-left: auto; margin-right: auto;" />
- **Grade Improvement:**
<img src="https://proteus-prod-public.s3.us-east-1.amazonaws.com/temp/e929a7f0-5f3a-4bc0-a82d-a1c006fed992.png" alt="Grade" style="max-width: 50%; display: block; margin-left: auto; margin-right: auto;" />
- **Percentile Chart:**
<img src="https://proteus-prod-public.s3.us-east-1.amazonaws.com/temp/9c118d30-632b-4669-9193-504fcf39d3bf.png" alt="Percentile Chart" style="max-width: 50%; display: block; margin-left: auto; margin-right: auto;" />
# Underserved Financial Fraud Dataset
### Synthetic fraud detection data for underrepresented_communities
**Created with Adaptive Data by Adaption** | CC BY 4.0 | 5 languages
---
## What This Is
A synthetic financial fraud dataset covering **four underserved community archetypes** — populations that rely on remittance transfers, gig economy payouts, prepaid cards, and ITIN-based transactions. These communities are disproportionately targeted by fraud, yet no open-source fraud dataset has ever modeled their financial behavior.
This dataset fills that gap.
---
## The Four Archetypes
| Archetype | Who | Fraud Vectors | Languages |
|---|---|---|---|
| **Remittance Sender** | Immigrants sending money cross-border via Western Union, Remitly, MoneyGram | Emergency call scams, fake exchange rate bonuses, interception | es, ht, yo, hi, en |
| **Gig Worker** | Uber, DoorDash, Instacart workers paid via CashApp, Venmo | Account takeover, SIM swap, fake platform support calls | en, hi, vi, es, yo |
| **Unbanked Cash-In User** | Populations using prepaid cards and retail kiosks | Predatory micro-loans, load-fee scams, fake utility kiosks | en, es, vi, yo, hi |
| **ITIN Entrepreneur** | Immigrant small business owners with no SSN | Synthetic identity fraud, fake tax returns, mule accounts | en, es, hi, ta, vi |
## Languages
`en` English | `es` Spanish | `hi` Hinglish | `ht` Haitian Creole | `yo` Yoruba | `vi` Vietnamese | `ta` Tamil | `ta-en` Tamil-English
---
## What Makes It Different
**No existing fraud dataset covers this population.**
PaySim simulates generic mobile money. Sparkov models middle-class credit cards. IEEE-CIS captures e-commerce. None remittance kiosks, gig payouts, or ITIN-linked accounts.
**Generated with diffusion, not rules.**
Tabular data generated using Tab-DDPM (denoising diffusion for tabular data) — learns joint correlations across behavioral features, not just independent column sampling. Trained on A100 GPU via Google Colab Pro.
**Multilingual narrative text.**
Every fraud transaction has a `narrative_text` field — the scam message or fraud description in the community's language. Generated by Adaptive Data by Adaption. Quality score improved from E (5.0) to A (9.2–9.4).
**Reasoning traces.**
390 chain-of-thought fraud analysis examples — step-by-step investigator reasoning grounded in community-specific fraud signals. No existing fraud dataset includes this. Built for fine-tuning financial language models (FinBERT, Gemma).
---
## How It Was Built
```
1. Scrape 1,040 real fraud narratives from CFPB, BBB Scam Tracker,
and Reddit archive (Pullpush.io)
2. Profile Behavioral distributions per archetype derived from
scraped narratives — amounts, channels, corridors,
fraud vectors, language mix
3. Generate Tab-DDPM trains on 5,000 seed rows per archetype,
learns joint feature correlations, generates 5,000
synthetic transactions per archetype
4. Narrate Adaptive Data by Adaption fills narrative_text in
8 languages per transaction's fraud context
5. Trace 390 reasoning traces generated — chain-of-thought
fraud analysis for fine-tuning use
```
---
## Schema (Key Fields)
| Field | Type | Description |
|---|---|---|
| `transaction_id` | uuid | Unique identifier |
| `archetype` | categorical | remittance / gig_worker / unbanked / itin |
| `amount_usd` | float | Transaction amount |
| `channel` | categorical | retail_kiosk / mobile_app / p2p / bank_wire |
| `fraud_vector` | categorical | Specific scam type |
| `is_fraud` | bool | Ground truth label |
| `fraud_confidence` | float | 0.0–1.0 label confidence |
| `narrative_text` | string | Scam description in community language |
| `narrative_language` | categorical | ISO 639-1 language code |
| `reasoning_trace` | string | Chain-of-thought fraud analysis (sampled rows) |
## Intended Use
- Training fraud detection models on underserved community transaction patterns
- Benchmarking existing models (IEEE-CIS trained) against this population
- Fine-tuning financial language models on multilingual fraud narratives
- Research into AI fairness and financial inclusion
- NLP research on under-resourced financial language
---
## What This Is Not
This is a **fully synthetic** dataset. No real transaction data. No PII. Behavioral distributions are informed by public fraud narratives and World Bank remittance corridor data — not empirically measured transaction logs. Like all synthetic fraud datasets (PaySim, Sparkov, Cifer-AF), ground truth validation against real data is not possible due to privacy constraints.
---
## Origin
This dataset was created as part of the **Uncharted Data Challenge** by Adaption Labs (April 2026). It extends the [Fraud Detection Framework](https://github.com/nachammai779/Fraud-Detection-Framework---An-Agentic-RAG-Pipeline-with-Custom-Financial-SLM) — an Agentic RAG pipeline with a custom Financial SLM built on the IEEE-CIS dataset (AUC-ROC 0.9486). The underserved dataset enables direct benchmarking: how does a model trained on mainstream data perform on populations it has never seen?
---
## Citation
```bibtex
@dataset{palaniappan2026underserved,
author = {Palaniappan, Nachammai},
title = {Underserved Financial Fraud Dataset},
year = {2026},
publisher = {HuggingFace},
note = {Created with Adaptive Data by Adaption.
Uncharted Data Challenge, Adaption Labs.},
url = {https://huggingface.co/datasets/nachammai779/underserved-financial-fraud}
}
```
---
## Credits
- **Adaptive Data by Adaption** — Narrative generation and dataset enrichment
- **Tab-DDPM** (Kotelnikov et al., 2022) — Tabular diffusion model
- **CFPB** — Consumer Financial Protection Bureau public complaint database
- **BBB Scam Tracker** — Better Business Bureau public scam reports
- **Pullpush.io** — Reddit archive API
---
*License: CC BY 4.0 — Free to use with attribution*
提供机构:
Nachammai41
搜集汇总
数据集介绍

构建方式
在金融欺诈检测领域,针对少数群体如ITIN持有者的欺诈行为往往缺乏代表性数据。该数据集通过创新的合成数据生成方法构建,首先从CFPB、BBB Scam Tracker等公开渠道收集了1040个真实欺诈叙事作为种子数据,随后利用Tab-DDPM(表格去噪扩散概率模型)学习各欺诈特征间的联合分布,生成了涵盖四个少数群体原型的合成交易记录。每个数据点均通过Adaption的自适应数据平台进行叙事文本填充,生成多语言欺诈描述,并额外添加了390条链式思维推理轨迹,以模拟调查人员的逐步分析过程,从而在保护隐私的前提下构建出高度逼真的欺诈场景数据集。
特点
本数据集的核心特点在于其专注于传统欺诈检测数据集中常被忽视的少数群体,如依赖汇款、零工经济收入或ITIN交易的移民社区。数据集包含3317个提示模板,旨在生成第一人称的欺诈受害叙事,覆盖身份盗窃、税务欺诈等多种欺诈向量。其叙事文本以轶事风格呈现,支持包括英语、西班牙语在内的八种语言,增强了数据的多样性与真实性。此外,数据集经过重制后质量评级为A,相对质量提升达88%,确保了数据在语言生成任务中的高可用性。
使用方法
该数据集主要应用于指令微调任务,特别是为生成式语言模型提供结构化提示,以合成针对ITIN持有者的金融欺诈叙事。研究人员可利用这些提示模板,结合指定的欺诈向量、金融工具和交易金额等变量,生成逼真的受害者故事,用于训练或评估欺诈检测模型。在更广泛的背景下,数据集支持多语言自然语言处理研究,助力于提升金融语言模型在少数群体语境下的性能,并为人工智能公平性和金融包容性研究提供关键数据资源。
背景与挑战
背景概述
在金融科技与人工智能交叉领域,针对少数群体金融欺诈的数据资源长期匮乏。itin-fraud-narratives数据集由Adaption Labs于2026年创建,作为‘未探索数据挑战’项目的一部分,旨在填补这一空白。该数据集聚焦于持有个人纳税人识别号码的移民群体,通过生成式人工智能技术构建合成叙事,模拟身份盗窃、税务欺诈等犯罪场景。其核心研究问题在于如何为机器学习模型提供高质量、多样化的训练数据,以提升对特定社区金融欺诈行为的检测能力,从而推动金融包容性与人工智能公平性研究。
当前挑战
该数据集致力于解决金融欺诈检测领域中对少数群体代表性不足的挑战,传统欺诈检测模型往往基于主流交易数据训练,难以识别针对移民社区的特殊诈骗模式。在构建过程中,面临多重技术障碍:首先,原始真实欺诈叙事数据稀缺且分散,需从多个公开渠道爬取并整合;其次,合成数据的生成需确保行为特征的联合分布符合真实世界模式,而非简单独立采样;此外,多语言叙事文本的生成要求模型理解不同文化背景下的欺诈语境,以保持叙述的真实性与一致性。这些挑战共同指向了合成数据在真实性、多样性与伦理合规性方面的平衡难题。
常用场景
经典使用场景
在金融欺诈检测领域,itin-fraud-narratives数据集为研究者和开发者提供了针对ITIN持有者群体的合成欺诈叙事模板。这些模板通过指定欺诈向量、金融工具和交易金额等变量,能够生成第一人称的受害者故事,从而模拟身份盗窃、税务欺诈等具体场景。该数据集主要应用于指令调优任务,为语言模型生成高质量的、面向移民社区的金融欺诈文本数据,以支持模型在特定语境下的理解和生成能力。
实际应用
在实际应用中,itin-fraud-narratives数据集可用于训练和评估欺诈检测模型,特别是在服务不足的社区场景下。金融机构和科技公司可借助这些合成数据,增强其系统对ITIN相关欺诈模式的识别能力,从而提升风险管理的精准度。此外,该数据集支持多语言叙事生成,有助于开发面向全球移民群体的金融安全教育工具和客户支持系统,促进金融包容性与安全性。
衍生相关工作
该数据集衍生了多项经典研究工作,特别是在扩展“未服务金融欺诈数据集”的框架内。相关研究利用Tab-DDPM等扩散模型生成合成交易数据,并结合链式思维推理痕迹,用于微调金融语言模型如FinBERT和Gemma。这些工作不仅推动了欺诈检测算法在公平性基准测试中的进展,还为构建面向边缘化社区的智能代理RAG管道提供了数据基础,深化了金融人工智能的应用边界。
以上内容由遇见数据集搜集并总结生成



