five

callmesan/ADIMA

收藏
Hugging Face2024-06-11 更新2024-06-29 收录
下载链接:
https://hf-mirror.com/datasets/callmesan/ADIMA
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: path_to_audio dtype: audio: sampling_rate: 16000 - name: abuse dtype: string - name: language dtype: string splits: - name: train num_bytes: 5192187842.424 num_examples: 8128 - name: test num_bytes: 2347579907.564 num_examples: 3647 download_size: 5849025117 dataset_size: 7539767749.988 configs: - config_name: default data_files: - split: train path: data/train-* - split: test path: data/test-* task_categories: - audio-classification language: - ta - bn - gu - hi - or - ml - kn - pa tags: - audio-abuse size_categories: - 10K<n<100K --- ADIMA is a dataset by [ShareChat Inc](http://sharechat.com/research/adima). Dataset Statistcs and other information in the [paper](https://ieeexplore.ieee.org/abstract/document/9746718). I am in no way affiliated with ShareChat. Just helping other users in Open Science. Cite them with the following: ``` @INPROCEEDINGS{9746718, author={Gupta, Vikram and Sharon, Rini and Sawhney, Ramit and Mukherjee, Debdoot}, booktitle={ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, title={ADIMA: Abuse Detection In Multilingual Audio}, year={2022}, volume={}, number={}, pages={6172-6176}, abstract={Abusive content detection in spoken text can be addressed by performing Automatic Speech Recognition (ASR) and leveraging advancements in natural language processing. However, ASR models introduce latency and often perform sub-optimally for abusive words as they are underrepresented in training corpora and not spoken clearly or completely. Exploration of this problem entirely in the audio domain has largely been limited by the lack of audio datasets. Building on these challenges, we propose ADIMA, a novel, linguistically diverse, ethically sourced, expert annotated and well- balanced multilingual abuse detection audio dataset comprising of 11,775 audio samples in 10 Indic languages spanning 65 hours and spoken by 6,446 unique users. Through quantitative experiments across monolingual and cross-lingual zeroshot settings, we take the first step in democratizing audio based content moderation in Indic languages and set forth our dataset to pave future work. Dataset and code are available at: https://github.com/ShareChatAI/Adima}, keywords={Training;Ethics;Codes;Speech coding;Conferences;Buildings;Signal processing;Abusive Content Detection;Multilingual Audio Analysis;Indic Dataset;Crosslingual Audio Analysis}, doi={10.1109/ICASSP43922.2022.9746718}, ISSN={2379-190X}, month={May},} ```
提供机构:
callmesan
原始信息汇总

ADIMA 数据集概述

数据集信息

特征

  • path_to_audio: 音频文件路径,采样率为16000。
  • abuse: 字符串类型,表示是否包含滥用内容。
  • language: 字符串类型,表示音频的语言。

数据分割

  • train: 训练集,包含8128个样本,大小为5192187842.424字节。
  • test: 测试集,包含3647个样本,大小为2347579907.564字节。

数据大小

  • 下载大小: 5849025117字节。
  • 数据集总大小: 7539767749.988字节。

配置

  • default:
    • train: 数据文件路径为 data/train-*
    • test: 数据文件路径为 data/test-*

任务类别

  • audio-classification: 音频分类任务。

语言

  • ta: 泰米尔语
  • bn: 孟加拉语
  • gu: 古吉拉特语
  • hi: 印地语
  • or: 奥里亚语
  • ml: 马拉雅拉姆语
  • kn: 卡纳达语
  • pa: 旁遮普语

标签

  • audio-abuse: 音频滥用检测。

数据集规模

  • 10K<n<100K: 数据集规模在10,000到100,000个样本之间。
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作