five

vikrant-vikram/INDICA

收藏
Hugging Face2026-04-10 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/vikrant-vikram/INDICA
下载链接
链接失效反馈
官方服务:
资源简介:
# INDICA: An Audio Indic-Language Telecom Fraud Analysis Benchmark <p align="center"> <b>Multilingual | Audio + Text | Benchmark for Fraud Detection</b> </p> <p align="center"> <img src="https://img.shields.io/badge/Dataset-189K%2B%20Samples-blue"> <img src="https://img.shields.io/badge/Languages-10-green"> <img src="https://img.shields.io/badge/Tasks-3-orange"> <img src="https://img.shields.io/badge/Modalities-Audio%20%2B%20Text-purple"> </p> --- ## Overview INDICA is a comprehensive benchmark for telecom fraud call analysis in Indic languages. It is built on the IndiF dataset, the first large-scale multilingual dataset for fraud detection in telecom conversations. This benchmark enables research in: - Scenario Classification - Fraud Call Detection - Fraud-Type Classification INDICA addresses key challenges: - Lack of multilingual datasets - Limited reproducibility - Poor cross-lingual generalization --- ## Key Contributions - **IndiF Dataset** - 189,420 samples across 10 Indic languages - Audio + text modalities - Fine-grained annotations - **INDICA Benchmark** - Evaluation across speech, text, and multimodal models - Standardized experimental setup - **Multilingual Setup** - Enables robust cross-lingual evaluation --- ## Dataset: IndiF ### Languages Assamese, Bengali, English, Gujarati, Hindi, Kannada, Malayalam, Odia, Tamil, Telugu ### Dataset Statistics | Property | Value | |---------|------| | Total Samples | 189,420 | | Languages | 10 | | Samples per Language | 18,942 | | Modalities | Audio + Text | | Tasks | 3 | ### Annotations - Scenario Classification (7 categories) - Fraud vs Non-Fraud (Binary) - Fraud-Type Classification (7 types) --- ## Tasks ### Scenario Classification - Customer Service - Delivery - Ride-hailing - Retail Transactions - Appointment Scheduling - Food Ordering - Traffic Information ### Fraud Detection ``` Fraud vs Non-Fraud ``` ### Fraud-Type Classification - Banking Fraud - Customer Service Impersonation - Investment Scam - Phishing - Lottery Scam - Kidnapping - Identity Theft --- ## Dataset Files - Gujarati_audio.tar.gz — Gujarati audio - Hindi_audio.tar.gz — Hindi audio - Tamil_audio.tar.gz — Tamil audio - Text_samples.tar.gz — Text data - ....... --- ## How to Use ### Download ```bash hf download vikrant-vikram/INDICA Gujarati_audio.tar.gz hf download vikrant-vikram/INDICA Hindi_audio.tar.gz hf download vikrant-vikram/INDICA Tamil_audio.tar.gz hf download vikrant-vikram/INDICA Text_samples.tar.gz ...... ``` ### Extract ```bash tar -xzf Gujarati_audio.tar.gz tar -xzf Hindi_audio.tar.gz tar -xzf Tamil_audio.tar.gz tar -xzf Text_samples.tar.gz .... ``` ### Structure ``` Audio_samples/ ├── Gujarati/ ├── Hindi/ ├── Tamil/ .... Text_samples/ ├── Gujarati/ ├── Hindi/ ├── Tamil/ .... ``` --- ## Authors - Nitin Choudhury - Samyuktha Chilaka - Bikrant Bikram Pratap Maurya - Arun Balaji Buduru --- ## Citation ```bibtex @inproceedings{indica2026, title={INDICA: An Audio Indic-Language Telecom Fraud Analysis Benchmark}, author={Choudhury, Nitin and Chilaka, Samyuktha and Maurya, Bikrant and Buduru, Arun}, year={2026} } ```
提供机构:
vikrant-vikram
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作