Arabic Scam and Legitimate Call Conversation Dataset (ASLC-448)
收藏DataCite Commons2026-04-13 更新2026-05-04 收录
下载链接:
https://data.mendeley.com/datasets/p384bgyzz3
下载链接
链接失效反馈官方服务:
资源简介:
## Description
This dataset presents a novel multi-dialect Arabic scam and legitimate telephone call conversation corpus designed for training and evaluating scam detection models. The dataset addresses a critical gap in Arabic-language fraud detection research, where no publicly available scam call datasets currently exist.
The dataset contains 448 annotated conversations covering nine Arabic dialects: Modern Standard Arabic (MSA), Egyptian, Gulf, Jordanian, Saudi, Yemeni, Sudanese, Iraqi, and Syrian. Each conversation simulates a realistic telephone interaction structured as a multi-turn dialogue between a caller and a receiver over five utterance turns (three caller turns and two receiver turns).
## Data Structure
The Excel file contains 18 columns per conversation:
| Column | Type | Description |
|--------|------|-------------|
| conversation_id | String | Unique identifier (CONV_0001 to CONV_0448) |
| full_conversation | String | Complete conversation text with speaker labels |
| caller_turn_1 | String | First caller utterance |
| receiver_turn_1 | String | First receiver response |
| caller_turn_2 | String | Second caller utterance |
| receiver_turn_2 | String | Second receiver response |
| caller_turn_3 | String | Third caller utterance |
| label | String | Binary class label: scam or not_scam |
| category | String | Fine-grained category (23 categories) |
| dialect | String | Arabic dialect (9 dialects) |
| urgency_score | Integer | Time pressure intensity (0–5) |
| sensitive_info_requests | Integer | Confidential data solicitation (0–2) |
| financial_pressure_score | Integer | Monetary demands intensity (0–5) |
| threat_score | Integer | Threat/intimidation level (0–3) |
| impersonation_score | Integer | Identity deception level (0–2) |
| conversation_length | Integer | Total characters in conversation |
| word_count | Integer | Total words in conversation |
| label_binary | Integer | Binary encoding: 1 = scam, 0 = not_scam |
| File | Description |
|------|-------------|
| arabic_scam_dataset_complete.xlsx | Complete text dataset with 448 conversations, labels, categories, dialects, risk scores, and metadata (18 columns) |
| audio_dataset/scam/*.wav | Synthesized audio files for scam conversations (16 kHz, mono, WAV) |
| audio_dataset/not_scam/*.wav | Synthesized audio files for legitimate conversations (16 kHz, mono, WAV) |
---
## License
CC BY 4.0 (Creative Commons Attribution 4.0 International)
提供机构:
Mendeley Data
创建时间:
2026-02-16



