vanila434/multilingual-elder-safety-msgs
收藏Hugging Face2026-04-19 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/vanila434/multilingual-elder-safety-msgs
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- zh
- yue
- en
- vi
- km
- lo
license: cc-by-4.0
task_categories:
- text-classification
- question-answering
tags:
- fraud-detection
- elder-safety
- chinese-american
- multilingual
- southeast-asian
- contrastive-pairs
- cultural-levers
- adaption-labs
- uncharted-data-challenge
- localization
pretty_name: Multilingual Elder Safety Messages
size_categories:
- 1K<n<10K
configs:
- config_name: default
data_files:
- split: train
path: multilingual_elder_safety_msgs.csv
- split: holdout
path: holdout_rows.csv
---
# multilingual-elder-safety-msgs
A hand-authored, multilingual elder fraud-recognition and safety coaching dataset. **467 curated scam/safe scenarios** in Chinese and English, with platform-generated coaching responses localized across **5 languages**: Chinese, English, Vietnamese, Khmer (Cambodian), and Lao. Expanded to **1,029 rows** through Adaption Labs platform reasoning traces and multilingual adaptation.
Built for communities where filial piety, authority deference, and fear of home-country authorities are systematically exploited by scammers — spanning Chinese-American, Vietnamese, Cambodian, and Laotian diaspora elders.
**Adaption Labs Uncharted Data Challenge submission.**
| Metric | Value |
|---|---|
| Input rows (scenarios) | 467 |
| Output rows (after adaptation) | 1,029 |
| Scenario languages | Mandarin, Cantonese, English + code-mixed |
| Coaching response languages | Chinese, English, Vietnamese, Khmer, Lao |
| Score before | 10.0 (A) |
| Score after | **9.5 (A)** |
| Completion quality | 9.04 → 9.63 (+6.5%), floor raised min 2 → 8 |
| Message quality | 9.29 → 9.39 (+1.1%), 78.4th percentile |
| Percentile | 57.7 |
| License | CC-BY-4.0 |
---
## Quick Start
```python
from datasets import load_dataset
ds = load_dataset("vanila434/multilingual-elder-safety-msgs", split="train")
# Filter by label
scams = ds.filter(lambda r: r["label"] == "scam")
safe = ds.filter(lambda r: r["label"] == "safe")
# Filter by input language
cantonese = ds.filter(lambda r: r["language"] == "cantonese")
# Access a row
row = ds[0]
print(row["text"]) # Original scam/safe message (Chinese/English)
print(row["enhanced_completion"]) # Coaching response (may be Vietnamese, Khmer, Lao, etc.)
print(row["reasoning_trace"]) # Chain-of-thought analysis
print(row["label"]) # scam / safe / ambiguous
print(row["cultural_lever"]) # filial_piety, authority_deference, etc.
```
---
## 1. Motivation
### Why this dataset exists
Elder fraud targeting Asian diaspora communities is a large, growing, and systematically under-addressed problem. The cultural levers that make these scams effective -- authority deference, filial piety, fear of home-country authorities, health anxiety, collectivist pressure -- are absent from every existing English-language fraud detection dataset.
These cultural levers are not unique to Chinese-Americans. They operate across Vietnamese, Cambodian, Laotian, and Malaysian diaspora communities with similar intensity. A grandmother in Ho Chi Minh City and a grandmother in San Francisco Chinatown are exploited by the same psychological mechanisms, adapted to different institutional contexts.
This dataset exists to train and evaluate models that coach the adult children of Asian elders on whether a message their parent received is a scam, legitimate, or ambiguous -- and what to do next -- in their community's language.
### The scale of the problem
- **$4.885 billion** in reported losses by victims aged 60+ in 2024 -- a 43% year-over-year increase (FBI IC3 2024)
- **Chinese authority impersonation:** >$40 million over 14 months, 350+ complaints, average $164,000 per victim (IC3 PSA 190328)
- **SFPD Chinatown blessing scams:** 7 cases, ~$374,000 stolen, all elders (SFPD bulletin 25-004)
- **Vietnam:** 16,000+ online fraud cases reported in 2023; elder-targeted "grandson in trouble" and fake bank calls rising (Ministry of Public Security)
- **Cambodia/Laos:** Cross-border scam call centers documented by INTERPOL and UNODC targeting diaspora communities across Southeast Asia
### Who funded or initiated this work
Created by Vania with contributions from friends and family for the Adaption Labs Uncharted Data Challenge. No external funding. The author has personal proximity to the Chinese-American target community. Southeast Asian localization produced via the Adaption platform's multilingual adaptation features.
---
## 2. Composition
### What makes this dataset different
1. **Multilingual coaching outputs.** Same fraud scenarios answered in 5 languages — Chinese, English, Vietnamese, Khmer, Lao — via platform adaptation. Enables cross-lingual safety deployment.
2. **Cultural lever annotation.** The `cultural_lever` column (filial piety, authority deference, collectivist pressure, fear of home-country authorities, health anxiety) is absent from all prior public fraud datasets.
3. **Contrastive pair structure.** Scam/legitimate pairs share surface features but flip the recommended action (Kaushik et al., ICLR 2020).
4. **Metadata-driven answer-change.** Every metadata column demonstrably changes the correct response.
5. **Platform-expanded at scale.** 467 curated scenarios → 1,029 rows with reasoning traces, localized coaching, and quality floor guarantees.
### How the multilingual design works
| Layer | Languages | Source |
|---|---|---|
| **Input scenarios** (`text` column) | Mandarin, Cantonese, English, code-mixed | Hand-authored |
| **Coaching responses** (`enhanced_completion`) | Chinese, English, Vietnamese, Khmer, Lao | Platform-adapted |
| **Reasoning traces** (`reasoning_trace`) | Multilingual | Platform-generated |
The platform took each scenario and produced coaching responses appropriate for different target audiences. A Cantonese grandparent scam gets coaching responses in Vietnamese for a Vietnamese family, in Khmer for a Cambodian family, etc. — each with localized hotline numbers, institution names, and cultural framing.
### Enhanced completion language distribution
| Response language | Rows | % |
|---|---|---|
| English | 244 | 24% |
| Chinese | 239 | 23% |
| Khmer | 189 | 18% |
| Lao | 189 | 18% |
| Vietnamese | 168 | 16% |
### Schema (16 columns + 3 platform-added)
| # | Column | Type | Role |
|---|---|---|---|
| 1 | `id` | `cdef-NNN` | Stable identifier |
| 2 | `text` | Free text | The scam/safe message in original language |
| 3 | `text_english_gloss` | Free text | English translation for evaluators |
| 4 | `label` | `scam` / `safe` / `ambiguous` | Ground truth classification |
| 5 | `category` | `fraud` / `legitimate_communication` / `edge_case` | Coarse class |
| 6 | `sub_type` | 38+ values | Fine-grained fraud/legitimate type |
| 7 | `language` | `mandarin` / `cantonese` / `english` / `code_mixed_*` | Input message language |
| 8 | `script` | `simplified` / `traditional` / `romanized` / `mixed` | Input writing system |
| 9 | `channel` | `phone_call` / `sms` / `wechat_1to1` / `wechat_group` / `mail` / `in_person` | Communication channel |
| 10 | `target_demographic` | `elder_immigrant` / `adult_child_caregiver` / `cross_demo_edge` | Who receives the message |
| 11 | `cultural_lever` | `filial_piety` / `authority_deference` / `collectivist_pressure` / `fear_of_home_country_authorities` / `health_anxiety` / `none` | Sociocultural hook exploited |
| 12 | `reasoning_tag` | e.g. `urgency_manipulation`, `isolation_tactic`, `legitimate_process` | Why this is/isn't fraud |
| 13 | `context_type` | `original` / `contrast` / `negative_space` / `edge_case` | Structural role |
| 14 | `recommended_action` | e.g. `hang_up_and_verify`, `verify_via_official_channel` | Action tag |
| 15 | `completion` | Free text | Hand-authored coaching response |
| 16 | `rejected_completion` | Free text (subset) | Bad response for DPO training |
| 17 | `enhanced_prompt` | Free text | Platform: scenario + context inlined |
| 18 | `enhanced_completion` | Free text | Platform: multilingual coaching response |
| 19 | `reasoning_trace` | Free text | Platform: chain-of-thought analysis |
### Input scenario distributions
**Label:**
| Label | Count | % |
|---|---|---|
| safe | 579 | 56% |
| scam | 408 | 40% |
| ambiguous | 42 | 4% |
**Input language:**
| Language | Count | % |
|---|---|---|
| Mandarin | 470 | 46% |
| Cantonese | 325 | 32% |
| English | 166 | 16% |
| Code-mixed | 68 | 6% |
**Target demographic:**
| Demographic | Count | % |
|---|---|---|
| elder_immigrant | 760 | 74% |
| adult_child_caregiver | 211 | 20% |
| cross_demo_edge | 43 | 4% |
| caregiver | 15 | 2% |
**Cultural lever:**
| Lever | Count | Cross-community relevance |
|---|---|---|
| filial_piety | 422 | Chinese, Vietnamese, Cambodian, Laotian |
| authority_deference | 285 | Universal (police/bank impersonation) |
| collectivist_pressure | 136 | Chinese (investment groups), Vietnamese (family obligation) |
| fear_of_home_country_authorities | 105 | Chinese (PRC police), Vietnamese (Cong An) |
| none | 57 | Legitimate messages |
| health_anxiety | 24 | Universal (elder health targeting) |
### Data instances
**Example: scam scenario with Vietnamese coaching response**
```
text: 呢度係香港警察總部。你個身份證被人用嚟洗黑錢...唔好同屋企人講。
label: scam
cultural_lever: authority_deference
enhanced_completion: ## Phân tích thủ đoạn lừa đảo
Đây là một vụ lừa đảo mạo danh cảnh sát điển hình...
Kẻ gian sử dụng chiêu thức "tài khoản an toàn" — cảnh sát thật
KHÔNG BAO GIỜ yêu cầu chuyển tiền qua điện thoại...
Gọi ngay đường dây nóng chống lừa đảo: 18222
```
**Example: safe scenario with Khmer coaching response**
```
text: 嫲嫲今晚早啲瞓,我同媽咪明早9點過嚟接您去睇醫生...
label: safe
cultural_lever: filial_piety
enhanced_completion: យាយស្រលាញ់របស់ចៅ 🙏❤️
ចៅ Kevin ផ្ញើសារមកប្រាប់យាយថា...
សារនេះគឺជាសារក្រុមគ្រួសារធម្មតា មិនមានសញ្ញាបោកប្រាស់ទេ...
```
**Example: scam with Lao coaching response**
```
text: This is Officer Chen from the IRS Criminal Investigation Division...
label: scam
cultural_lever: authority_deference
enhanced_completion: ຈາກການວິເຄາະບົດບັນທຶກການໂທລະສັບດັ່ງກ່າວ...
ນີ້ແມ່ນການຫຼອກລວງແບບ IRS Impersonation Scam...
ໂທຫາ IRS ໂດຍກົງ: 1-800-829-1040
```
---
## 3. Collection Process
### Core authoring (467 scenarios)
All core scenarios were hand-authored by the author and contributors with proximity to the Chinese-American community. No victim data was used. Scam scripts are synthetic reconstructions grounded in published law enforcement reports (FBI IC3, FCC, SFPD). English glosses were machine-translated and spot-checked.
### Multilingual adaptation (platform-generated)
The Adaption platform's adaptation features produced coaching responses in Vietnamese, Khmer, and Lao. The localization includes:
- **Localized institution references**: Vietnamese responses mention Cong An (police), Zalo (messaging platform); Khmer/Lao responses adapt to local context
- **Localized hotlines and verification channels**: Country-appropriate reporting numbers
- **Culturally-appropriate framing**: Filial piety manifests differently across Confucian and Theravada Buddhist contexts; responses adapt accordingly
- **Platform references**: Zalo (Vietnam), WhatsApp, Shopee adapted into coaching responses
### Blueprint iteration (9 runs)
| Version | Result | Lesson |
|---|---|---|
| v1 | 7.0 → 6.3 (C), -10% | Scam-bias — "name the lever being exploited" presupposed fraud |
| v2 | 7.0 → 8.5 (B), +21% | Classification-first framing fixed misclassification |
| v3 | 8.0 → 8.0 (B), 0% | Over-constrained; 5 simultaneous changes regressed quality |
| No blueprint | 8.0 → 9.4 (A), +18% | Platform's default beat all hand-crafted blueprints |
| v3 + fixes | 9.0 → 9.4 (A), +4% | Completion quality fixes raised floor |
| **Final (submission)** | **10.0 → 9.5 (A)** | **Multilingual adaptation; min=2 → 8** |
### Completion quality fixes (pre-submission)
Analysis of run_008 identified 26 short enhanced_completions (<200 chars):
- **3 broken rows**: Platform echoed metadata labels instead of generating responses. Rewrote with full coaching (1665-1866 chars)
- **18 short Chinese rows**: Added `【判定:合法訊息】` verdict blocks with verification signal analysis
Fixes merged before final submission run.
### Adaption platform configuration
| Parameter | Value |
|---|---|
| Dataset name | `multilingual_elder_safety_msgs` |
| Dataset ID | `c8599996-5188-4eec-856d-2221f37aaafa` |
| Recipes | `reasoning_traces` + `prompt_rephrase` + `deduplication` |
| `hallucination_mitigation` | ON |
| `length` | `detailed` |
| Blueprint | v3 (classification-first + anti-reframing + multilingual guidance) |
| Input rows | 467 |
| Output rows | 1,029 |
### Platform evaluation
| Dimension | Before | After | Gain | Percentile |
|---|---|---|---|---|
| **Message quality** | 9.29 | 9.39 | +1.1% | 78.4th |
| **Completion quality** | 9.04 | 9.63 | +6.5% | 37.0th |
| **Completion floor** | min=2 | min=8 | — | — |
| **Completion consistency** | std=1.83 | std=0.60 | — | — |
| **Overall** | 10.0 (A) | 9.5 (A) | — | 57.7th |
**Cost transparency — full run history:**
| Run | Date | Rows | Score | Notes |
|---|---|---|---|---|
| run_001 | 2026-04-16 | 3 | 7.0 → 6.3 (C) | Blueprint v1 failure |
| run_002 | 2026-04-17 | 203 | 7.0 → 8.5 (B) | Blueprint v2 fix |
| run_003 | 2026-04-17 | 203 | 8.0 → 8.0 (B) | Over-constrained — reverted |
| run_004 | 2026-04-17 | 207 | 8.0 → 9.4 (A) | No blueprint — B → A breakthrough |
| run_005 | 2026-04-17 | 217 | 7.0 → 9.2 (A) | Updated completions; min=1 → 6 |
| run_006 | 2026-04-17 | 245 | 7.0 → 9.1 (A) | Expanded rows |
| run_007 | 2026-04-17 | 245 | 7.0 → 8.9 (B) | Regression |
| run_008 | 2026-04-19 | 235 | 9.0 → 9.4 (A) | Best input quality, 57.7th pct |
| **run_009** | **2026-04-19** | **467** | **10.0 → 9.5 (A)** | **Multilingual; completion floor 2 → 8** |
---
## 4. Preprocessing, Cleaning, and Labeling
### Audit framework
Audits ran at rows 15, 35, 40, 80, 112, and 235 using 8 automated questions:
1. Metadata operativity — does each column flip `recommended_action`?
2. Sub_type coverage — all present, none over-represented?
3. Contrast pair integrity — every fraud row paired with legitimate twin?
4. Language balance — within 5pp of target?
5. Demographic balance — within 5pp of 70/20/10?
6. Completion quality — natural sentences, not tags?
7. Cultural lever distribution — each lever sufficient?
8. False positive risk — surface-similar safe rows exist?
All audits passed at final row count (235 core, 467 with expansion).
### DEITA diversity audit
Pairwise cosine similarity on `text` column using `paraphrase-multilingual-MiniLM-L12-v2`:
| Metric | Value |
|---|---|
| Mean pairwise similarity | 0.335 |
| Pairs above 0.85 threshold | 2 (reviewed, retained — intentional contrast pairs) |
### Per-column operativity proof
| Column | Pair example | What flips |
|---|---|---|
| `language` | Mandarin vs Cantonese police impersonation | Different hotline numbers |
| `channel` | `phone_call` vs `wechat_1to1` | Hang up vs screenshot + block |
| `target_demographic` | `elder_immigrant` vs `adult_child_caregiver` | Coach vs act directly |
| `cultural_lever` | `authority_deference` vs `filial_piety` | Different de-escalation |
| `context_type` | `original` scam vs `contrast` twin | Opposite action |
| `script` | Simplified vs traditional | Script mismatch = fraud signal |
### Reasoning tag as Orca-2 reasoning-strategy analog
The `reasoning_tag` column is not just metadata for slicing — it is a **supervision signal for which reasoning strategy** applies to each row (following Orca-2's reasoning technique selection; Mitra et al., 2023). Examples:
| reasoning_tag | Strategy required |
|---|---|
| `legitimate_process,verified_source` | Verify credentials against known sources |
| `authority_impersonation,urgency_manipulation` | Decompose authority claim, check for time pressure |
| `guaranteed_returns,sunk_cost_escalation` | Financial logic check, escalation pattern recognition |
| `isolation_tactic` | Flag secrecy demand as cardinal fraud signal |
### Contrastive pair methodology
The contrast-twin structure implements **counterfactually-augmented data** (Kaushik et al., ICLR 2020) and **contrast sets** (Gardner et al., EMNLP 2020). Each scam row is paired with a legitimate twin sharing surface features but flipping the causal feature:
- Same institutional name (e.g., "Hong Kong Police")
- Same language and register
- Same channel
- Different: presence/absence of fund-transfer request, verifiable reference number, isolation tactic
This method measurably reduces spurious-correlation learning and exposes 25%+ gaps in models that appear to have "saturated" on standard held-out accuracy.
### DPO-ready preference pairs
70 rows include a `rejected_completion` — a **plausible-wrong** response (not obviously bad) following UltraFeedback methodology (Cui et al., 2023):
| Rejection type | Example |
|---|---|
| Hedged non-action | "This might be a scam but I'm not sure, maybe check later" |
| Missed isolation tactic | Identifies urgency but not the "don't tell family" red flag |
| Over-flagging safe row | Treats a real bank notification as suspicious |
| Generic without specifics | "Be careful with phone calls" (no hotline, no steps) |
The chosen-rejected gap is calibrated to be moderate (not cartoonish), per literature finding that too-large reward gaps cause overfitting.
### Register-gap disclosure
This dataset captures a specific speech register: **WeChat-vernacular Chinese** (traditional-character Cantonese and simplified-character Mandarin) as used by 60+ first-generation diaspora elders in informal family communication and as targeted by scammers. This register is not well-represented in any open corpus. The Southeast Asian coaching responses (Vietnamese, Khmer, Lao) are platform-generated and may not perfectly match local elder-communication registers.
### Per-strata zero-shot evaluation (holdout set)
Beyond the platform score, we ran zero-shot classification on 30 held-out rows using Claude Sonnet and GPT-4o with no metadata — just the raw `text` field — following HELM methodology (Liang et al., 2022):
**By language:**
| Language | Claude Sonnet | GPT-4o | Rows |
|---|---|---|---|
| Mandarin | 77% | 92% | 13 |
| Cantonese | 57% | 86% | 7 |
| English | 67% | 83% | 6 |
| Code-mixed | 50% | 50% | 4 |
**By label:**
| Label | Claude Sonnet | GPT-4o | Rows |
|---|---|---|---|
| scam | 100% | 92% | 12 |
| safe | 47% | 80% | 15 |
| ambiguous | 0% | 33% | 3 |
**By target demographic:**
| Demographic | Claude Sonnet | GPT-4o | Rows |
|---|---|---|---|
| elder_immigrant | 61% | 89% | 18 |
| adult_child_caregiver | 75% | 75% | 8 |
| cross_demo_edge | 75% | 75% | 4 |
**Key findings:**
- Both models achieve near-perfect scam detection but **systematically over-flag legitimate messages** (safe accuracy: 47-80%).
- **Code-mixed texts are hardest** for both models (50% accuracy).
- The **elder_immigrant demographic** shows the largest model gap, suggesting fine-tuning on this dataset would have highest impact.
- **Ambiguous rows expose model limitations** — neither model handles genuine ambiguity well.
Full results: `eval_results.json`. Script: `eval_harness.py`.
### Inter-rater reliability
Pre-committed threshold: Krippendorff's alpha >= 0.75 (pass), 0.67-0.74 (iterate), < 0.67 (re-author), following Krippendorff (2004) and Landis & Koch (1977).
Two annotators rated a subset on `label` (primary) and `cultural_lever` (secondary). The dataset card reports this openly — the small author pool is a limitation, partially offset by the automated audit framework catching systematic errors.
---
## 5. Uses
### Intended uses
- **Multilingual elder fraud detection.** Training/evaluating models that classify messages as scam/safe/ambiguous with coaching in Vietnamese, Khmer, Lao, Chinese, or English.
- **Cross-lingual safety deployment.** Same fraud patterns with responses in 5 languages — deploy one model across multiple Asian diaspora communities.
- **Culturally-aware LLM safety.** Teaching models that filial piety and authority deference are exploitation vectors, not just cultural features.
- **Consumer-protection chatbots.** Coaching adult children in their own language when a parent forwards a suspicious message.
- **DPO training.** 70 rows include `rejected_completion` for direct preference optimization.
- **Contrastive/counterfactual augmentation research.** Direct implementation of Kaushik et al. (ICLR 2020).
### Out-of-scope uses
- **Not a general fraud detector.** Covers Asian diaspora elder fraud patterns specifically.
- **Not a real-time scam filter.** Designed for human-readable coaching, not automated blocking.
- **Not for scam generation training.** Scripts are reconstructions from published law enforcement reports.
---
## 6. Distribution
- **License:** CC-BY-4.0
- **Format:** CSV (UTF-8)
- **Access:** Adaption Labs Uncharted Data Challenge + HuggingFace
### Artifacts included
| File | Description |
|---|---|
| `multilingual_elder_safety_msgs.csv` | 1,029-row adapted dataset (submission) |
| `holdout_rows.csv` | Held-out evaluation rows |
| `seed_rows_v3.csv` | 235-row core authored scenarios |
| `seed_rows_v4.csv` | 21 completion quality fixes |
| `training_rows.csv` | Training split input |
| `audit_log.csv` | Audit findings |
| `audit.py` | Automated audit script |
| `checklist_tests.csv` | Behavioral test suite (CheckList) |
| `eval_harness.py` | Zero-shot evaluation script |
| `fraud_typology.md` | Research grounding (FBI/IC3/SFPD) |
| `journal.md` | Full build journal |
| `README.md` | This dataset card |
| `LICENSE` | CC-BY-4.0 |
---
## 7. Maintenance
### Behavioral test suite
CheckList methodology (Ribeiro et al., ACL 2020):
- **MFT:** Basic scam/safe classification (7 tests)
- **INV:** Cross-lingual invariance — same scam in different languages preserves classification
- **DIR:** Adding verifiable phone numbers shifts scam → safe
### What we cut and why
| Cut | Reason |
|---|---|
| `preference_pairs` recipe | Does not exist in SDK 0.3.1 |
| `rejected_completion` on all rows | Scoped to 70 rows |
| Malay (Bahasa Malaysia) | Platform produced Vietnamese/Khmer/Lao; Malay deferred to v1.1 |
| Multi-turn dialogues | Reserved for v2 |
| Native-speaker validation for SEA | Planned for v1.1 release |
### Limitations
- **Small author pool.** Core scenarios authored by one person + friends/family. Cultural blind spots possible.
- **Synthetic only.** No real victim data.
- **SEA responses are platform-generated.** Vietnamese, Khmer, and Lao coaching responses were produced by the Adaption platform, not hand-authored by native speakers. May contain register errors or cultural inaccuracies.
- **Input scenarios are Chinese/English only.** The scam/safe messages themselves are not in Vietnamese/Khmer/Lao — only the coaching responses are. A fully localized dataset would have input messages in those languages too.
- **No Malay.** Platform produced Vietnamese, Khmer, and Lao but not Malay. Malaysian localization deferred.
- **Dialect boundaries.** Vietnamese covers standard northern dialect. Khmer covers standard Phnom Penh register.
- **No multi-turn conversations.** Single-message classification only.
- **Completion quality gap persists.** Message quality 78.4th percentile; completion quality 37.0th. Safe/negative_space responses are weakest.
### Ethical considerations
This dataset claims that specific cultural levers — authority deference, filial piety, fear of home-country authorities — are systematically exploited across Asian diaspora communities. It does not claim that Asian elders are uniquely gullible. The distinction matters (Blodgett et al., ACL 2020).
Scam scripts are reconstructions from published FBI/IC3/SFPD reports. Released under CC-BY-4.0 for research and protective use.
### Roadmap
- **v1.1:** Native-speaker validation for Vietnamese/Khmer/Lao. Add Malay localization. Input scenarios in SEA languages.
- **v2:** Chinese international student fraud (F-1/J-1 visa). Separate dataset.
- **v3:** Multi-turn dialogue sequences (pig-butchering grooming).
---
## Data Statement (Bender & Friedman v2)
### Language variety
- **Mandarin Chinese** (simplified): Mainland-origin diaspora, WeChat/SMS vernacular
- **Cantonese** (traditional): HK/Guangdong-origin diaspora
- **English**: US-born adult-child caregiver register
- **Vietnamese** (romanized): Standard northern dialect, formal/informal — platform-generated coaching
- **Khmer** (Khmer script): Standard register — platform-generated coaching
- **Lao** (Lao script): Standard register — platform-generated coaching
- **Code-mixed**: Cantonese-English, Mandarin-English diaspora switching
### Speaker demographics
All text is synthetic. Modeled speakers:
- **Scam perpetrators:** Impersonating officials, banks, family across PRC/HK/US institutions
- **Legitimate senders:** Real hospitals, banks, government agencies, family, community orgs
- **Target recipients:** First-generation Asian elders (60+), limited host-country language proficiency
- **Coaching voice:** Adult child or community worker explaining fraud/safety to elder or to caregiver sibling
### Annotator demographics
- **Lead author:** Vania, bilingual (English/Cantonese), proximity to Chinese-American community
- **Contributors:** Friends/family with Asian diaspora community proximity
- **SEA localization:** Adaption platform multilingual adaptation
---
## Citation
```bibtex
@misc{multilingual_elder_safety_msgs_2026,
title = {multilingual-elder-safety-msgs: A Multilingual Fraud-Recognition and
Safety Coaching Dataset for Asian Diaspora Elders},
author = {Vania},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/datasets/vanila434/multilingual-elder-safety-msgs}},
note = {Adaption Labs Uncharted Data Challenge 2026. 467 curated scenarios
expanded to 1,029 rows with coaching responses in Chinese, English,
Vietnamese, Khmer, and Lao. Contrastive pairs, cultural-lever annotations,
and platform-generated reasoning traces.}
}
```
---
## References
- Gebru, T. et al. (2021). *Datasheets for Datasets.* Communications of the ACM.
- Bender, E. M. & Friedman, B. (2018). *Data Statements for NLP.* TACL 6.
- Kaushik, D. et al. (2020). *Learning the Difference that Makes a Difference with Counterfactually-Augmented Data.* ICLR.
- Gardner, M. et al. (2020). *Evaluating Models' Local Decision Boundaries via Contrast Sets.* EMNLP.
- Ribeiro, M. T. et al. (2020). *Beyond Accuracy: Behavioral Testing of NLP Models with CheckList.* ACL.
- Zhou, C. et al. (2023). *LIMA: Less Is More for Alignment.* NeurIPS.
- Mukherjee, S. et al. (2023). *Orca: Progressive Learning from Complex Explanation Traces of GPT-4.*
- Blodgett, S. L. et al. (2020). *Language (Technology) is Power.* ACL.
- FBI IC3 (2024). *Elder Fraud Report.* [ic3.gov](https://www.ic3.gov/)
- FBI IC3 PSA 190328. *Chinese Authority Impersonation Scams.*
- FBI IC3 PSA 251113. *Health Insurance Impersonation Targeting Chinese Speakers.*
- SFPD Bulletin 25-004. *Blessing Scam Cases, Chinatown.*
- UNODC (2024). *Casinos, Cyber Fraud, and Trafficking in Persons.* Southeast Asia regional report.
提供机构:
vanila434



