callmesan/ADIMA
收藏Hugging Face2024-06-11 更新2024-06-29 收录
下载链接:
https://hf-mirror.com/datasets/callmesan/ADIMA
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: path_to_audio
dtype:
audio:
sampling_rate: 16000
- name: abuse
dtype: string
- name: language
dtype: string
splits:
- name: train
num_bytes: 5192187842.424
num_examples: 8128
- name: test
num_bytes: 2347579907.564
num_examples: 3647
download_size: 5849025117
dataset_size: 7539767749.988
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- split: test
path: data/test-*
task_categories:
- audio-classification
language:
- ta
- bn
- gu
- hi
- or
- ml
- kn
- pa
tags:
- audio-abuse
size_categories:
- 10K<n<100K
---
ADIMA is a dataset by [ShareChat Inc](http://sharechat.com/research/adima). Dataset Statistcs and other information in the [paper](https://ieeexplore.ieee.org/abstract/document/9746718).
I am in no way affiliated with ShareChat. Just helping other users in Open Science.
Cite them with the following:
```
@INPROCEEDINGS{9746718,
author={Gupta, Vikram and Sharon, Rini and Sawhney, Ramit and Mukherjee, Debdoot},
booktitle={ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
title={ADIMA: Abuse Detection In Multilingual Audio},
year={2022},
volume={},
number={},
pages={6172-6176},
abstract={Abusive content detection in spoken text can be addressed by performing Automatic Speech Recognition (ASR) and leveraging advancements in natural language processing. However, ASR models introduce latency and often perform sub-optimally for abusive words as they are underrepresented in training corpora and not spoken clearly or completely. Exploration of this problem entirely in the audio domain has largely been limited by the lack of audio datasets. Building on these challenges, we propose ADIMA, a novel, linguistically diverse, ethically sourced, expert annotated and well- balanced multilingual abuse detection audio dataset comprising of 11,775 audio samples in 10 Indic languages spanning 65 hours and spoken by 6,446 unique users. Through quantitative experiments across monolingual and cross-lingual zeroshot settings, we take the first step in democratizing audio based content moderation in Indic languages and set forth our dataset to pave future work. Dataset and code are available at: https://github.com/ShareChatAI/Adima},
keywords={Training;Ethics;Codes;Speech coding;Conferences;Buildings;Signal processing;Abusive Content Detection;Multilingual Audio Analysis;Indic Dataset;Crosslingual Audio Analysis},
doi={10.1109/ICASSP43922.2022.9746718},
ISSN={2379-190X},
month={May},}
```
提供机构:
callmesan
原始信息汇总
ADIMA 数据集概述
数据集信息
特征
- path_to_audio: 音频文件路径,采样率为16000。
- abuse: 字符串类型,表示是否包含滥用内容。
- language: 字符串类型,表示音频的语言。
数据分割
- train: 训练集,包含8128个样本,大小为5192187842.424字节。
- test: 测试集,包含3647个样本,大小为2347579907.564字节。
数据大小
- 下载大小: 5849025117字节。
- 数据集总大小: 7539767749.988字节。
配置
- default:
- train: 数据文件路径为
data/train-*。 - test: 数据文件路径为
data/test-*。
- train: 数据文件路径为
任务类别
- audio-classification: 音频分类任务。
语言
- ta: 泰米尔语
- bn: 孟加拉语
- gu: 古吉拉特语
- hi: 印地语
- or: 奥里亚语
- ml: 马拉雅拉姆语
- kn: 卡纳达语
- pa: 旁遮普语
标签
- audio-abuse: 音频滥用检测。
数据集规模
- 10K<n<100K: 数据集规模在10,000到100,000个样本之间。



