five

Arabic Audio Text Dataset Repository

收藏
DataCite Commons2025-09-14 更新2026-05-03 收录
下载链接:
https://figshare.com/articles/dataset/Arabic_Audio_Text_Dataset_Repository/30121789
下载链接
链接失效反馈
官方服务:
资源简介:
This dataset, from the IDEAL 2025 project, was created to evaluate how well commercial speech-to-text (STT) tools transcribe Arabic speech.<br><br>Dataset Components<br>The repository includes several key components:<br><b>Audio Files</b>: Original recordings of native Arabic speakers, categorized by length (Short, Medium, Long, Very Long).<b>Human Transcripts</b>: Manually created transcripts that serve as the "ground truth" for accuracy comparison.<b>Tool Transcripts</b>: Machine-generated transcripts from six commercial STT tools: Clipto, Maestra, Notta, Sonix, Turboscribe, and Veed.<b>Metadata</b>: An Excel file containing details like recording duration, topic, word counts, speaker age, gender, and transcription accuracy scores.<b>Ethical Documentation</b>: Includes IRB approval, ensuring data was collected ethically and personally identifiable information was removed.
提供机构:
figshare
创建时间:
2025-09-14
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作