openbank-uz/youtube_transcriptions
收藏Hugging Face2026-03-26 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/openbank-uz/youtube_transcriptions
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- uz
license: cc-by-nc-4.0
task_categories:
- automatic-speech-recognition
- text-to-speech
tags:
- uzbek
- speech
- audio
- tts
- asr
- youtube
- gemini
- speaker-clustering
pretty_name: Uzbek YouTube Speech Dataset
size_categories:
- 100K<n<1M
---
## Dataset Description
A speech dataset of Uzbek language audio clips sourced from YouTube videos. Audio segments were extracted, separated by speaker using vocal isolation, and transcribed using Google's **Gemini 2.0 Flash** model. Speaker identities were clustered using ECAPA-TDNN embeddings.
### Use Cases
- **Automatic Speech Recognition (ASR)** for Uzbek
- **Text-to-Speech (TTS)** synthesis for Uzbek
- Fine-tuning speech models on Uzbek language data (e.g., Qwen3-TTS)
- Speaker-conditioned TTS training
## Code
- Scraping and transcription: [Github Repo](https://github.com/Guide-Me-Tech/pdsrt)
## Dataset Structure
| Column | Type | Description |
|---|---|---|
| `audio` | `audio` | Audio column (playable on Hugging Face), 16kHz |
| `file` | `string` | Original audio filename |
| `gender` | `string` | Speaker gender (`male` / `female`) |
| `transcription` | `string` | Uzbek text transcription (generated by Gemini 2.0 Flash) |
| `speaker_id` | `string` | Clustered speaker identity (e.g., `spk_131`), 230 unique speakers |
| `parsed_time` | `string` | Timestamp when the source video was processed |
| `video_title` | `string` | Title of the source YouTube video |
| `youtube_video_duration` | `string` | Duration of the source video |
| `audio_format` | `string` | Audio encoding format (e.g., `opus32`) |
| `speaker_id_in_video` | `string` | Original diarization speaker label within the video (e.g., `01`) |
| `speaker_audio_index` | `int` | Index of this audio segment for the speaker within the video |
| `prompt_tokens_used` | `int64` | Number of prompt tokens consumed during transcription |
| `total_tokens_used` | `int64` | Total tokens consumed during transcription |
### Example
```python
{
"file": "2025-01-10_20-08-32_umrimiz_asli_qisqa_emas_..._vocalsSPEAKER_01_0.mp3",
"gender": "male",
"transcription": "Haqiqat qoladi, yo'qolmaydi.",
"speaker_id": "spk_131",
"parsed_time": "2025-01-10_20-08-32",
"video_title": "umrimiz asli qisqa emas kopini bekorga sarflab yuboramiz iqtibos podcast 35",
"youtube_video_duration": "31m43s",
"audio_format": "opus32",
"speaker_id_in_video": "01",
"speaker_audio_index": 0,
"prompt_tokens_used": 41,
"total_tokens_used": 72,
"audio": {"path": "...", "array": [...], "sampling_rate": 16000}
}
```
## Data Collection Pipeline
1. **Source**: Public Uzbek-language YouTube videos (podcasts, talks, interviews)
2. **Audio extraction**: Audio tracks extracted from videos and converted to MP3 (Opus 32kbps)
3. **Speaker separation**: Vocals isolated and split by speaker diarization (speaker segments labeled as `SPEAKER_XX`)
4. **Transcription**: Each audio segment transcribed using **Gemini 2.0 Flash**
5. **Gender labeling**: Speaker gender annotated per segment
6. **Speaker clustering**: Global speaker identities assigned across videos (see below)
## Speaker Clustering
Speaker identities (`speaker_id`) were assigned across the full dataset using the following method:
1. Speaker Embeddings were created for each audio
2. Embeddings then were clustered to 230 unique spaekers using Mix of two clustering methods
### Embedding Extraction
Speaker embeddings were extracted using [SpeechBrain's ECAPA-TDNN](https://huggingface.co/speechbrain/spkrec-ecapa-voxceleb) model pretrained on VoxCeleb.
For code see this notebook: [📒 Google Colab Notebook](https://colab.research.google.com/drive/1YqsvFru-sB_5R85DU1_-qzGxnBaPUk1A?usp=sharing)
### Clustering Method
Due to the dataset size (~360K rows), a two-stage clustering approach was used:
1. **Optimal cluster count selection**: Agglomerative Clustering (cosine distance, average linkage) was run on a 10,000-sample subset, testing cluster counts in `range(50, 600, 20)`. The best silhouette score determined the optimal number of clusters.
2. **Full dataset clustering**: MiniBatchKMeans (batch size 4096) was applied to all ~360K samples using the optimal cluster count from step 1.
**Result**: 230 unique speakers identified.
```python
# Step 1: Find optimal n on subset
X_sub = X[:10000]
for n in range(50, 600, 20):
clustering = AgglomerativeClustering(
n_clusters=n, metric="cosine", linkage="average"
)
labels = clustering.fit_predict(X_sub)
score = silhouette_score(X_sub, labels, metric="cosine")
# Step 2: Cluster full dataset with best n
clustering = MiniBatchKMeans(
n_clusters=best_n, batch_size=4096, random_state=42
)
labels = clustering.fit_predict(X)
```
### Caveats
- Clustering is approximate — some speakers may be split across multiple IDs or merged into one, especially for speakers with similar vocal characteristics.
- The original in-video diarization labels (`speaker_id_in_video`) are preserved for cross-referencing.
- ECAPA-TDNN was not fine-tuned on Uzbek speech, which may affect clustering quality.
## Limitations
- Transcriptions are machine-generated (Gemini 2.0 Flash) and may contain errors, especially for domain-specific vocabulary, names, or dialectal speech.
- Audio quality varies depending on the original YouTube source.
- Gender labels may not be verified for every segment.
- Speaker clustering is unsupervised and approximate — not human-verified.
- The dataset is sourced from publicly available YouTube content. If you are a content creator and wish to have your content removed, please open a discussion.
## Citation
If you use this dataset, please cite it as:
```bibtex
@dataset{openbank-uz/youtube_transcriptions,
title={Uzbek YouTube Speech Dataset},
year={2025},
source={YouTube},
transcription_model={Gemini 2.0 Flash},
speaker_clustering={ECAPA-TDNN + MiniBatchKMeans}
}
```
提供机构:
openbank-uz



