five

lab260/deepspeech_balalaika

收藏
Hugging Face2026-01-29 更新2026-05-10 收录
下载链接:
https://hf-mirror.com/datasets/lab260/deepspeech_balalaika
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mpl-2.0 task_categories: - text-to-speech - automatic-speech-recognition language: - ru pretty_name: Deepspeech annotate by Balalaika --- # Deepspeech Annotated by Balalaika **A curated Russian speech dataset for advanced speech generative tasks.** ## Overview **Deepspeech Annotated by Balalaika** is a high-quality Russian speech corpus, meticulously filtered and annotated by the **lab260 team at MTUCI** with the latest version of our pipeline, **BALALAIKA**. - **Language:** Russian only - **Genres:** Podcasts, public speech, YouTube, audiobooks, phone calls, TTS, and more - **Source:** Deepspeech ([GitHub link](https://github.com/GeorgeFedoseev/DeepSpeech)) - **License:** mpl-2.0 (same as original DeepSpeech) - **Total Duration After Filtering:** 278.6 hours (from over 6000 hours raw) - **Format:** Parquet files with split-wise annotation *** ## Usage **Primary Use Cases:** - Text-to-Speech (TTS) generation - Automatic Speech Recognition (ASR) - Analysis of accent, stress, and prosody - Russian speech technology research ### 1. Download the dataset ### 2. Extract the files ```basg for archive in *.tar.gz; do dir="${archive%.tar.gz}" mkdir -p "$dir" tar -xzvf "$archive" -C "$dir" rm "$archive" done ``` ### 3. Load data in PyTorch ```python from pathlib import Path import pandas as pd from torch.utils.data import Dataset import torchaudio class ParquetConcatDataset(Dataset): def __init__(self, parquet_dir, audio_root, parse_fn=None): self.parquet_dir = Path(parquet_dir) self.audio_root = Path(audio_root) parquet_files = list(self.parquet_dir.glob("*.parquet")) dfs = [pd.read_parquet(f) for f in parquet_files] self.df = pd.concat(dfs, ignore_index=True) def __len__(self): return len(self.df) def __getitem__(self, idx): row = self.df.iloc[idx] audio_path = self.audio_root / row["filepath"] waveform, sample_rate = torchaudio.load(audio_path) return { "audio_path": str(audio_path), "waveform": waveform, "sample_rate": sample_rate, "nisqa_mos": row["mos_pred"], "nisqa_noi": row["noi_pred"], "nisqa_dis": row["dis_pred"], "nisqa_col": row["col_pred"], "nisqa_loud": row["loud_pred"], "nisqa_model": row["model"], "is_single_speaker": bool(row["is_single_speaker"]), "accented_text": row["accent"], "asr_text": row["rover"], "punctuated_text": row["punct"], "silence_percent": row["silence_percent"], "total_duration": row["total_duration"], "max_silence_duration": row["max_silence_duration"] } # Example usage ds = ParquetConcatDataset( PATH_TO_PARQUETS_DIR, PATH_TO_AUDIO_ROOT ) ``` `PATH_TO_PARQUETS_DIR`: Path to the folder containing all .parquet files with metadata and annotations for the dataset. `PATH_TO_AUDIO_ROOT`: Path to the root directory containing all audio subfolders and files referenced by filepath columns in the metadata. *** ## Data Processing & Annotation Our pipeline applies **rigorous filtering and enrichment** steps: 1. **Removed speech segments** shorter than 3 seconds 2. **Filtered segments** with [NISQA](https://github.com/gabrielmittag/NISQA/tree/master/nisqa) MOS < 4.0 for quality assurance 3. **Excluded segments with multiple speakers** (via [pyannotate diarization](https://huggingface.co/pyannote/speaker-diarization-community-1)) 4. **Filtered segments** with [VAD](https://github.com/snakers4/silero-vad) silence_percent > 30.0 % and max_silence_duration > 1,2 for quality assurance 5. **Filtered out speech with music background** (custom music detector) 6. **Revised transcriptions:** Crowd-sourced with multiple ASRs, fused via ROVER ([T-one](https://github.com/voicekit-team/T-one/tree/main), [GigaAMv3-rnnt, GigaAMv3-ctc, GigaAMv3-ctc-lm](https://github.com/salute-developers/GigaAM), [vosk](https://huggingface.co/alphacep/vosk-model-ru)) 7. **Punctuation added** using [RuPunct](https://huggingface.co/RUPunct/RUPunct_big) 8. **Stress marks added** via [RuAccent](https://github.com/Den4ikAI/ruaccent) 9. **IPA phonemization** performed with our own neural model All **annotation fields** are handled and provided separately for transparency and flexibility. *** ## Data Structure - **Annotation storage:** Parquet files - **Speech storage:** .tar.gz files with speech segments in .wav - **Splitting:** Follows DeepSpeech splits - **Annotations:** Each sample includes separate fields for: - **Filepath** - **Quality metrics: MOS, NOI, DIS, COL, LOUD** - **Model for quality assesment** - **Transcript with stresses and pucntuation** - **Transcript after ROVER** - **Transcript with punctuation** - **IPA transcription** - **Speaker diarization flag** - **Information about silence** *** ## How to Cite Please cite the following paper if you use this dataset in research: ``` @misc{borodin2025datacentricframeworkaddressingphonetic, title={A Data-Centric Framework for Addressing Phonetic and Prosodic Challenges in Russian Speech Generative Models}, author={Kirill Borodin and Nikita Vasiliev and Vasiliy Kudryavtsev and Maxim Maslov and Mikhail Gorodnichev and Oleg Rogov and Grach Mkrtchian}, year={2025}, eprint={2507.13563}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2507.13563}, } ``` *** ## Contact - Telegram: [@korallll_ai](https://t.me/korallll_ai) - Email: [k.n.borodin@mtuci.ru](mailto:k.n.borodin@mtuci.ru) *** ## Links - [Balalaka annotation pipeline](https://github.com/mtuciru/balalaika/tree/main/src) - [Other datasets annotated by BALALAIKA](https://huggingface.co/collections/MTUCI/balalaika-dataset) - [Custom models' inference implementaton](https://huggingface.co/collections/MTUCI/balalaika-models) - [Paper (arXiv)](https://arxiv.org/pdf/2507.13563) - [DeepSpeech repository](https://github.com/GeorgeFedoseev/DeepSpeech) - [NISQA](https://github.com/gabrielmittag/NISQA/tree/master/nisqa) - [pyannotate diarization](https://huggingface.co/pyannote/speaker-diarization-community-1) - [T-one](https://github.com/voicekit-team/T-one/tree/main) - [GigaAM v2-rnnt, GigaAMv2-ctc, GigaAMv2-ctc-lm](https://github.com/salute-developers/GigaAM) - [vosk](https://huggingface.co/alphacep/vosk-model-ru) - [RuPunct](https://huggingface.co/RUPunct/RUPunct_big) - [RuAccent](https://github.com/Den4ikAI/ruaccent) *** ## License Distributed under **MPL 2.0**, matching original DeepSpeech terms. ***
提供机构:
lab260
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作