Jurabek/uzbekvoice-filtered2

Name: Jurabek/uzbekvoice-filtered2
Creator: Jurabek
Published: 2026-04-13 11:53:22
License: 暂无描述

Hugging Face2026-04-13 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/Jurabek/uzbekvoice-filtered2

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - uz license: apache-2.0 size_categories: - 100K<n<1M task_categories: - automatic-speech-recognition dataset_info: features: - name: path dtype: audio: sampling_rate: 16000 - name: text dtype: string - name: previous_text dtype: string - name: id dtype: int64 - name: client_id dtype: string - name: duration dtype: float64 - name: sentence dtype: string - name: created_at dtype: string - name: original_sentence_id dtype: string - name: sentence_clips_count dtype: int64 - name: upvotes_count dtype: int64 - name: downvotes_count dtype: int64 - name: reported_count dtype: int64 - name: reported_reasons dtype: string - name: skipped_clips dtype: int64 - name: gender dtype: string - name: accent_region dtype: string - name: native_language dtype: string - name: year_of_birth dtype: string splits: - name: train num_bytes: 13791343519.24 num_examples: 501330 - name: validate num_bytes: 57649995.584 num_examples: 2048 download_size: 13680801049 dataset_size: 13848993514.824 configs: - config_name: default data_files: - split: train path: data/train-* - split: validate path: data/validate-* --- This is heavy filtered version of the dataset with additional information. This dataset does not contain original Mozilla Common Voice audios or texts We filtered the dataset using number approaches: 1. VAD + Noise detection. Audios which lacked voice activity and produced no sound after denoiser were removed 2. Reading Speed. Audios with outlier speeds (approximately 5-10%), as they didnt match natural speed or were too noisy 3. Automatic STT validation. We trained the model using subset of valid samples from different authors and used trained model to extend the number of samples given their transcription match our trained model output to some extend, then we repeated this step multiple times until we reached this dataset size

提供机构：

Jurabek

5,000+

优质数据集

54 个

任务类型

进入经典数据集