five

Jurabek/uzbekvoice-filtered2

收藏
Hugging Face2026-04-13 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/Jurabek/uzbekvoice-filtered2
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - uz license: apache-2.0 size_categories: - 100K<n<1M task_categories: - automatic-speech-recognition dataset_info: features: - name: path dtype: audio: sampling_rate: 16000 - name: text dtype: string - name: previous_text dtype: string - name: id dtype: int64 - name: client_id dtype: string - name: duration dtype: float64 - name: sentence dtype: string - name: created_at dtype: string - name: original_sentence_id dtype: string - name: sentence_clips_count dtype: int64 - name: upvotes_count dtype: int64 - name: downvotes_count dtype: int64 - name: reported_count dtype: int64 - name: reported_reasons dtype: string - name: skipped_clips dtype: int64 - name: gender dtype: string - name: accent_region dtype: string - name: native_language dtype: string - name: year_of_birth dtype: string splits: - name: train num_bytes: 13791343519.24 num_examples: 501330 - name: validate num_bytes: 57649995.584 num_examples: 2048 download_size: 13680801049 dataset_size: 13848993514.824 configs: - config_name: default data_files: - split: train path: data/train-* - split: validate path: data/validate-* --- This is heavy filtered version of the dataset with additional information. This dataset does not contain original Mozilla Common Voice audios or texts We filtered the dataset using number approaches: 1. VAD + Noise detection. Audios which lacked voice activity and produced no sound after denoiser were removed 2. Reading Speed. Audios with outlier speeds (approximately 5-10%), as they didnt match natural speed or were too noisy 3. Automatic STT validation. We trained the model using subset of valid samples from different authors and used trained model to extend the number of samples given their transcription match our trained model output to some extend, then we repeated this step multiple times until we reached this dataset size
提供机构:
Jurabek
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作