vikrant-vikram/INDICA
收藏Hugging Face2026-04-10 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/vikrant-vikram/INDICA
下载链接
链接失效反馈官方服务:
资源简介:
# INDICA: An Audio Indic-Language Telecom Fraud Analysis Benchmark
<p align="center">
<b>Multilingual | Audio + Text | Benchmark for Fraud Detection</b>
</p>
<p align="center">
<img src="https://img.shields.io/badge/Dataset-189K%2B%20Samples-blue">
<img src="https://img.shields.io/badge/Languages-10-green">
<img src="https://img.shields.io/badge/Tasks-3-orange">
<img src="https://img.shields.io/badge/Modalities-Audio%20%2B%20Text-purple">
</p>
---
## Overview
INDICA is a comprehensive benchmark for telecom fraud call analysis in Indic languages.
It is built on the IndiF dataset, the first large-scale multilingual dataset for fraud detection in telecom conversations.
This benchmark enables research in:
- Scenario Classification
- Fraud Call Detection
- Fraud-Type Classification
INDICA addresses key challenges:
- Lack of multilingual datasets
- Limited reproducibility
- Poor cross-lingual generalization
---
## Key Contributions
- **IndiF Dataset**
- 189,420 samples across 10 Indic languages
- Audio + text modalities
- Fine-grained annotations
- **INDICA Benchmark**
- Evaluation across speech, text, and multimodal models
- Standardized experimental setup
- **Multilingual Setup**
- Enables robust cross-lingual evaluation
---
## Dataset: IndiF
### Languages
Assamese, Bengali, English, Gujarati, Hindi, Kannada, Malayalam, Odia, Tamil, Telugu
### Dataset Statistics
| Property | Value |
|---------|------|
| Total Samples | 189,420 |
| Languages | 10 |
| Samples per Language | 18,942 |
| Modalities | Audio + Text |
| Tasks | 3 |
### Annotations
- Scenario Classification (7 categories)
- Fraud vs Non-Fraud (Binary)
- Fraud-Type Classification (7 types)
---
## Tasks
### Scenario Classification
- Customer Service
- Delivery
- Ride-hailing
- Retail Transactions
- Appointment Scheduling
- Food Ordering
- Traffic Information
### Fraud Detection
```
Fraud vs Non-Fraud
```
### Fraud-Type Classification
- Banking Fraud
- Customer Service Impersonation
- Investment Scam
- Phishing
- Lottery Scam
- Kidnapping
- Identity Theft
---
## Dataset Files
- Gujarati_audio.tar.gz — Gujarati audio
- Hindi_audio.tar.gz — Hindi audio
- Tamil_audio.tar.gz — Tamil audio
- Text_samples.tar.gz — Text data
- .......
---
## How to Use
### Download
```bash
hf download vikrant-vikram/INDICA Gujarati_audio.tar.gz
hf download vikrant-vikram/INDICA Hindi_audio.tar.gz
hf download vikrant-vikram/INDICA Tamil_audio.tar.gz
hf download vikrant-vikram/INDICA Text_samples.tar.gz
......
```
### Extract
```bash
tar -xzf Gujarati_audio.tar.gz
tar -xzf Hindi_audio.tar.gz
tar -xzf Tamil_audio.tar.gz
tar -xzf Text_samples.tar.gz
....
```
### Structure
```
Audio_samples/
├── Gujarati/
├── Hindi/
├── Tamil/
....
Text_samples/
├── Gujarati/
├── Hindi/
├── Tamil/
....
```
---
## Authors
- Nitin Choudhury
- Samyuktha Chilaka
- Bikrant Bikram Pratap Maurya
- Arun Balaji Buduru
---
## Citation
```bibtex
@inproceedings{indica2026,
title={INDICA: An Audio Indic-Language Telecom Fraud Analysis Benchmark},
author={Choudhury, Nitin and Chilaka, Samyuktha and Maurya, Bikrant and Buduru, Arun},
year={2026}
}
```
提供机构:
vikrant-vikram



