BoburAmirov/podcasts_tashkent_dialect_youtube_uzbek_speech_dataset

Name: BoburAmirov/podcasts_tashkent_dialect_youtube_uzbek_speech_dataset
Creator: BoburAmirov
Published: 2025-12-11 11:45:47
License: 暂无描述

Hugging Face2025-12-11 更新2025-12-20 收录

下载链接：

https://hf-mirror.com/datasets/BoburAmirov/podcasts_tashkent_dialect_youtube_uzbek_speech_dataset

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集包含乌兹别克语（主要是塔什干方言）的音频片段及其对应的转录文本。数据来源于YouTube上的公开播客视频，主要用于自动语音识别（ASR）模型的训练和评估。大部分内容来自Jahongir Latipov的采访和Bu播客（尊重作者）的YouTube视频。数据使用Gemini 2.5 Pro进行转录，并经过智能过滤。音频片段经过分段和格式化，便于现代深度学习框架使用。

This dataset contains audio clips and their corresponding transcriptions in the Uzbek language with mostly tashkent dialects. The data was collected from publicly available podcast videos on YouTube. It is designed for training and evaluating Automatic Speech Recognition (ASR) models. Most of the content comes from the Jahongir Latipov interviews and Bu podcast (respect authors) YouTube videos. The data was transcribed using Gemini 2.5 Pro and was intelligently filtered. The audio clips are segmented and formatted for easy use with modern deep learning frameworks.

提供机构：

BoburAmirov

5,000+

优质数据集

54 个

任务类型

进入经典数据集