BAAD: A Multipurpose Dataset for Automatic Bangla Offensive Speech Recognition

Mendeley Data2024-01-31 更新2024-06-26 收录

下载链接：

https://data.mendeley.com/datasets/w24g8xn23c

下载链接

链接失效反馈

官方服务：

资源简介：

This dataset contains audio data of common abusive Bengali words. The audio data includes 114 slang words with 5277 audio clips by 60 native speakers who participated, speaking in various dialects from over 20 districts. The recorded audio data is natively recorded by the participants in .WAV format. • This dataset can be used to develop an automatic Bengali Slang Speech Recognition System, and also as a benchmark for new ML models. • This dataset can potentially help minimize cyberbully victims and children exposed to abusive remarks on video/audio content containing abusive language. In addition, this dataset of carefully collected offensive language in Bengali aims to work towards achieving this goal. • 65% of the participants were male and 35% were female. • 10 university students participated in the evaluation of this dataset. • This dataset can be further enriched, and some background noise in the dataset can be useful to simulate a more real world scenario, if desired, which otherwise could be removed. ***Warning: this dataset contains audio content that may be disturbing or upsetting. ***

本数据集收录了常见孟加拉语辱骂性词汇的音频数据。本次采集的音频涵盖114个俚语词汇，由60名母语使用者录制，共包含5277条音频片段；录制者来自超过20个行政区，使用各类方言进行录制。所有录制音频均由参与者以母语录制，存储格式为.WAV文件。 • 本数据集可用于开发自动化孟加拉语俚语语音识别系统，同时亦可作为新型机器学习（Machine Learning, ML）模型的基准测试数据集。 • 本数据集可有效降低网络欺凌受害者以及接触含辱骂性语言的音视频内容的儿童所面临的伤害风险。此外，本次精心采集的孟加拉语冒犯性语言数据集，旨在助力达成这一公益目标。 • 参与录制的人员中，男性占比65%，女性占比35%。 • 另有10名大学生参与了本数据集的评估工作。 • 本数据集具备进一步扩充完善的空间；若有需求，数据中留存的部分背景噪声可用于模拟更贴近真实世界的应用场景，若无需此用途也可将其去除。 ***警告：本数据集包含可能令人不适或造成情绪困扰的音频内容。***

创建时间：

2024-01-31

5,000+

优质数据集

54 个

任务类型

进入经典数据集