Unlabelled Arabic Speech Dataset

NIAID Data Ecosystem2026-05-01 收录

下载链接：

https://data.mendeley.com/datasets/yhy76b2c7b

下载链接

链接失效反馈

官方服务：

资源简介：

Speech is crucial for communication, especially in the era of artificial intelligence. Many fields, including security and forensics, rely on analyzing speech, where it can be used to identify people, but there's a challenge when it comes to Arabic language data. Most speech analysis models are primarily trained on English data and don't work well with Arabic. Therefore, a large dataset with over a million Arabic samples was built for training models and improving performance. The samples were collected from Arabic podcasts, so they're clear and noise-free. The audio recordings were extracted and segmented into one-second segments. Then, they were transformed into Mel-spectrograms, using a 22 kHz sampling rate, a frame size of 2048 samples, and a hop length of 512 samples. Researchers working with Arabic speech data can benefit from this dataset to try different machine learning and deep learning models. Using our dataset to improve Arabic speaker identification, we trained a Siamese speech embedding model. We tested its performance using a benchmark dataset, which can be viewed in the paper titled "Unsupervised Arabic Speech Embedding Model for Speaker Identification" https://ieeexplore-ieee-org.aus.idm.oclc.org/document/10191576.

创建时间：

2023-09-18

5,000+

优质数据集

54 个

任务类型

进入经典数据集