lab260/deepspeech_balalaika
收藏Hugging Face2026-01-29 更新2026-05-10 收录
下载链接:
https://hf-mirror.com/datasets/lab260/deepspeech_balalaika
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mpl-2.0
task_categories:
- text-to-speech
- automatic-speech-recognition
language:
- ru
pretty_name: Deepspeech annotate by Balalaika
---
# Deepspeech Annotated by Balalaika
**A curated Russian speech dataset for advanced speech generative tasks.**
## Overview
**Deepspeech Annotated by Balalaika** is a high-quality Russian speech corpus, meticulously filtered and annotated by the **lab260 team at MTUCI** with the latest version of our pipeline, **BALALAIKA**.
- **Language:** Russian only
- **Genres:** Podcasts, public speech, YouTube, audiobooks, phone calls, TTS, and more
- **Source:** Deepspeech ([GitHub link](https://github.com/GeorgeFedoseev/DeepSpeech))
- **License:** mpl-2.0 (same as original DeepSpeech)
- **Total Duration After Filtering:** 278.6 hours (from over 6000 hours raw)
- **Format:** Parquet files with split-wise annotation
***
## Usage
**Primary Use Cases:**
- Text-to-Speech (TTS) generation
- Automatic Speech Recognition (ASR)
- Analysis of accent, stress, and prosody
- Russian speech technology research
### 1. Download the dataset
### 2. Extract the files
```basg
for archive in *.tar.gz; do
dir="${archive%.tar.gz}"
mkdir -p "$dir"
tar -xzvf "$archive" -C "$dir"
rm "$archive"
done
```
### 3. Load data in PyTorch
```python
from pathlib import Path
import pandas as pd
from torch.utils.data import Dataset
import torchaudio
class ParquetConcatDataset(Dataset):
def __init__(self, parquet_dir, audio_root, parse_fn=None):
self.parquet_dir = Path(parquet_dir)
self.audio_root = Path(audio_root)
parquet_files = list(self.parquet_dir.glob("*.parquet"))
dfs = [pd.read_parquet(f) for f in parquet_files]
self.df = pd.concat(dfs, ignore_index=True)
def __len__(self):
return len(self.df)
def __getitem__(self, idx):
row = self.df.iloc[idx]
audio_path = self.audio_root / row["filepath"]
waveform, sample_rate = torchaudio.load(audio_path)
return {
"audio_path": str(audio_path),
"waveform": waveform,
"sample_rate": sample_rate,
"nisqa_mos": row["mos_pred"],
"nisqa_noi": row["noi_pred"],
"nisqa_dis": row["dis_pred"],
"nisqa_col": row["col_pred"],
"nisqa_loud": row["loud_pred"],
"nisqa_model": row["model"],
"is_single_speaker": bool(row["is_single_speaker"]),
"accented_text": row["accent"],
"asr_text": row["rover"],
"punctuated_text": row["punct"],
"silence_percent": row["silence_percent"],
"total_duration": row["total_duration"],
"max_silence_duration": row["max_silence_duration"]
}
# Example usage
ds = ParquetConcatDataset(
PATH_TO_PARQUETS_DIR,
PATH_TO_AUDIO_ROOT
)
```
`PATH_TO_PARQUETS_DIR`: Path to the folder containing all .parquet files with metadata and annotations for the dataset.
`PATH_TO_AUDIO_ROOT`: Path to the root directory containing all audio subfolders and files referenced by filepath columns in the metadata.
***
## Data Processing & Annotation
Our pipeline applies **rigorous filtering and enrichment** steps:
1. **Removed speech segments** shorter than 3 seconds
2. **Filtered segments** with [NISQA](https://github.com/gabrielmittag/NISQA/tree/master/nisqa) MOS < 4.0 for quality assurance
3. **Excluded segments with multiple speakers** (via [pyannotate diarization](https://huggingface.co/pyannote/speaker-diarization-community-1))
4. **Filtered segments** with [VAD](https://github.com/snakers4/silero-vad) silence_percent > 30.0 % and max_silence_duration > 1,2 for quality assurance
5. **Filtered out speech with music background** (custom music detector)
6. **Revised transcriptions:** Crowd-sourced with multiple ASRs, fused via ROVER ([T-one](https://github.com/voicekit-team/T-one/tree/main), [GigaAMv3-rnnt, GigaAMv3-ctc, GigaAMv3-ctc-lm](https://github.com/salute-developers/GigaAM), [vosk](https://huggingface.co/alphacep/vosk-model-ru))
7. **Punctuation added** using [RuPunct](https://huggingface.co/RUPunct/RUPunct_big)
8. **Stress marks added** via [RuAccent](https://github.com/Den4ikAI/ruaccent)
9. **IPA phonemization** performed with our own neural model
All **annotation fields** are handled and provided separately for transparency and flexibility.
***
## Data Structure
- **Annotation storage:** Parquet files
- **Speech storage:** .tar.gz files with speech segments in .wav
- **Splitting:** Follows DeepSpeech splits
- **Annotations:** Each sample includes separate fields for:
- **Filepath**
- **Quality metrics: MOS, NOI, DIS, COL, LOUD**
- **Model for quality assesment**
- **Transcript with stresses and pucntuation**
- **Transcript after ROVER**
- **Transcript with punctuation**
- **IPA transcription**
- **Speaker diarization flag**
- **Information about silence**
***
## How to Cite
Please cite the following paper if you use this dataset in research:
```
@misc{borodin2025datacentricframeworkaddressingphonetic,
title={A Data-Centric Framework for Addressing Phonetic and Prosodic Challenges in Russian Speech Generative Models},
author={Kirill Borodin and Nikita Vasiliev and Vasiliy Kudryavtsev and Maxim Maslov and Mikhail Gorodnichev and Oleg Rogov and Grach Mkrtchian},
year={2025},
eprint={2507.13563},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2507.13563},
}
```
***
## Contact
- Telegram: [@korallll_ai](https://t.me/korallll_ai)
- Email: [k.n.borodin@mtuci.ru](mailto:k.n.borodin@mtuci.ru)
***
## Links
- [Balalaka annotation pipeline](https://github.com/mtuciru/balalaika/tree/main/src)
- [Other datasets annotated by BALALAIKA](https://huggingface.co/collections/MTUCI/balalaika-dataset)
- [Custom models' inference implementaton](https://huggingface.co/collections/MTUCI/balalaika-models)
- [Paper (arXiv)](https://arxiv.org/pdf/2507.13563)
- [DeepSpeech repository](https://github.com/GeorgeFedoseev/DeepSpeech)
- [NISQA](https://github.com/gabrielmittag/NISQA/tree/master/nisqa)
- [pyannotate diarization](https://huggingface.co/pyannote/speaker-diarization-community-1)
- [T-one](https://github.com/voicekit-team/T-one/tree/main)
- [GigaAM v2-rnnt, GigaAMv2-ctc, GigaAMv2-ctc-lm](https://github.com/salute-developers/GigaAM)
- [vosk](https://huggingface.co/alphacep/vosk-model-ru)
- [RuPunct](https://huggingface.co/RUPunct/RUPunct_big)
- [RuAccent](https://github.com/Den4ikAI/ruaccent)
***
## License
Distributed under **MPL 2.0**, matching original DeepSpeech terms.
***
提供机构:
lab260



