five

openbank-uz/youtube_transcriptions

收藏
Hugging Face2026-03-26 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/openbank-uz/youtube_transcriptions
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - uz license: cc-by-nc-4.0 task_categories: - automatic-speech-recognition - text-to-speech tags: - uzbek - speech - audio - tts - asr - youtube - gemini - speaker-clustering pretty_name: Uzbek YouTube Speech Dataset size_categories: - 100K<n<1M --- ## Dataset Description A speech dataset of Uzbek language audio clips sourced from YouTube videos. Audio segments were extracted, separated by speaker using vocal isolation, and transcribed using Google's **Gemini 2.0 Flash** model. Speaker identities were clustered using ECAPA-TDNN embeddings. ### Use Cases - **Automatic Speech Recognition (ASR)** for Uzbek - **Text-to-Speech (TTS)** synthesis for Uzbek - Fine-tuning speech models on Uzbek language data (e.g., Qwen3-TTS) - Speaker-conditioned TTS training ## Code - Scraping and transcription: [Github Repo](https://github.com/Guide-Me-Tech/pdsrt) ## Dataset Structure | Column | Type | Description | |---|---|---| | `audio` | `audio` | Audio column (playable on Hugging Face), 16kHz | | `file` | `string` | Original audio filename | | `gender` | `string` | Speaker gender (`male` / `female`) | | `transcription` | `string` | Uzbek text transcription (generated by Gemini 2.0 Flash) | | `speaker_id` | `string` | Clustered speaker identity (e.g., `spk_131`), 230 unique speakers | | `parsed_time` | `string` | Timestamp when the source video was processed | | `video_title` | `string` | Title of the source YouTube video | | `youtube_video_duration` | `string` | Duration of the source video | | `audio_format` | `string` | Audio encoding format (e.g., `opus32`) | | `speaker_id_in_video` | `string` | Original diarization speaker label within the video (e.g., `01`) | | `speaker_audio_index` | `int` | Index of this audio segment for the speaker within the video | | `prompt_tokens_used` | `int64` | Number of prompt tokens consumed during transcription | | `total_tokens_used` | `int64` | Total tokens consumed during transcription | ### Example ```python { "file": "2025-01-10_20-08-32_umrimiz_asli_qisqa_emas_..._vocalsSPEAKER_01_0.mp3", "gender": "male", "transcription": "Haqiqat qoladi, yo'qolmaydi.", "speaker_id": "spk_131", "parsed_time": "2025-01-10_20-08-32", "video_title": "umrimiz asli qisqa emas kopini bekorga sarflab yuboramiz iqtibos podcast 35", "youtube_video_duration": "31m43s", "audio_format": "opus32", "speaker_id_in_video": "01", "speaker_audio_index": 0, "prompt_tokens_used": 41, "total_tokens_used": 72, "audio": {"path": "...", "array": [...], "sampling_rate": 16000} } ``` ## Data Collection Pipeline 1. **Source**: Public Uzbek-language YouTube videos (podcasts, talks, interviews) 2. **Audio extraction**: Audio tracks extracted from videos and converted to MP3 (Opus 32kbps) 3. **Speaker separation**: Vocals isolated and split by speaker diarization (speaker segments labeled as `SPEAKER_XX`) 4. **Transcription**: Each audio segment transcribed using **Gemini 2.0 Flash** 5. **Gender labeling**: Speaker gender annotated per segment 6. **Speaker clustering**: Global speaker identities assigned across videos (see below) ## Speaker Clustering Speaker identities (`speaker_id`) were assigned across the full dataset using the following method: 1. Speaker Embeddings were created for each audio 2. Embeddings then were clustered to 230 unique spaekers using Mix of two clustering methods ### Embedding Extraction Speaker embeddings were extracted using [SpeechBrain's ECAPA-TDNN](https://huggingface.co/speechbrain/spkrec-ecapa-voxceleb) model pretrained on VoxCeleb. For code see this notebook: [📒 Google Colab Notebook](https://colab.research.google.com/drive/1YqsvFru-sB_5R85DU1_-qzGxnBaPUk1A?usp=sharing) ### Clustering Method Due to the dataset size (~360K rows), a two-stage clustering approach was used: 1. **Optimal cluster count selection**: Agglomerative Clustering (cosine distance, average linkage) was run on a 10,000-sample subset, testing cluster counts in `range(50, 600, 20)`. The best silhouette score determined the optimal number of clusters. 2. **Full dataset clustering**: MiniBatchKMeans (batch size 4096) was applied to all ~360K samples using the optimal cluster count from step 1. **Result**: 230 unique speakers identified. ```python # Step 1: Find optimal n on subset X_sub = X[:10000] for n in range(50, 600, 20): clustering = AgglomerativeClustering( n_clusters=n, metric="cosine", linkage="average" ) labels = clustering.fit_predict(X_sub) score = silhouette_score(X_sub, labels, metric="cosine") # Step 2: Cluster full dataset with best n clustering = MiniBatchKMeans( n_clusters=best_n, batch_size=4096, random_state=42 ) labels = clustering.fit_predict(X) ``` ### Caveats - Clustering is approximate — some speakers may be split across multiple IDs or merged into one, especially for speakers with similar vocal characteristics. - The original in-video diarization labels (`speaker_id_in_video`) are preserved for cross-referencing. - ECAPA-TDNN was not fine-tuned on Uzbek speech, which may affect clustering quality. ## Limitations - Transcriptions are machine-generated (Gemini 2.0 Flash) and may contain errors, especially for domain-specific vocabulary, names, or dialectal speech. - Audio quality varies depending on the original YouTube source. - Gender labels may not be verified for every segment. - Speaker clustering is unsupervised and approximate — not human-verified. - The dataset is sourced from publicly available YouTube content. If you are a content creator and wish to have your content removed, please open a discussion. ## Citation If you use this dataset, please cite it as: ```bibtex @dataset{openbank-uz/youtube_transcriptions, title={Uzbek YouTube Speech Dataset}, year={2025}, source={YouTube}, transcription_model={Gemini 2.0 Flash}, speaker_clustering={ECAPA-TDNN + MiniBatchKMeans} } ```
提供机构:
openbank-uz
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作