MTUCI/Balalaika500H

Name: MTUCI/Balalaika500H
Creator: MTUCI
Published: 2025-07-22 19:21:59
License: 暂无描述

Hugging Face2025-07-22 更新2026-01-03 收录

下载链接：

https://hf-mirror.com/datasets/MTUCI/Balalaika500H

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - ru license: cc-by-nc-nd-4.0 task_categories: - text-to-speech pretty_name: Balalaika tags: - russian - speech-synthesis - speech-enhancement - audio --- # A Data-Centric Framework for Addressing Phonetic and Prosodic Challenges in Russian Speech Generative Models Paper: [https://huggingface.co/papers/2507.13563](https://huggingface.co/papers/2507.13563) Code: [https://github.com/mtuciru/balalaika](https://github.com/mtuciru/balalaika) Russian speech synthesis presents distinctive challenges, including vowel reduction, consonant devoicing, variable stress patterns, homograph ambiguity, and unnatural intonation. This paper introduces Balalaika, a novel dataset comprising more than 2,000 hours of studio-quality Russian speech with comprehensive textual annotations, including punctuation and stress markings. Experimental results show that models trained on Balalaika significantly outperform those trained on existing datasets in both speech synthesis and enhancement tasks. --- ## Quick Start 👟 ```bash git clone https://github.com/mtuciru/balalaika && cd balalaika bash create_user_env.sh # sets up venv + pip deps bash use_meta_500h.sh # pick 100h / 500h / 1000h / 2000h as needed ``` ## Table of Contents 1. [Prerequisites](#prerequisites) 2. [Installation](#installation) 3. [Data Preparation](#data-preparation) - [Quick Setup (Default Parameters)](#quick-setup) - [Custom Metadata Download](#custom-metadata-download) 4. [Running the Pipeline](#running-the-pipeline) - [Basic Scenario (Local Processing)](#basic-scenario-local-processing) 5. [Configuration](#configuration) 6. [Environment Variables](#environment-variables) 7. [Models](#models) 8. [Citation](#citation)  9. [License](#license) --- ## Prerequisites Ensure you have the following tools installed on your system: ```bash sudo apt update && sudo apt install -y \ ffmpeg \ # video/audio toolkit python3 \ # Python python3-pip \ # Pip package manager python3-venv \ # std-lib virtual-env support python3-dev \ # headers for compiling native wheels python-is-python3 wget -qO- https://astral.sh/uv/install.sh | sh ``` --- ## Installation Clone the repository and set up the environment: ```bash git clone https://github.com/mtuciru/balalaika cd balalaika # Use this if you want to annotate/modify the dataset bash create_dev_env.sh # Use this if you only want to use the pre-annotated dataset bash create_user_env.sh ``` --- ## Data Preparation ### Quick Setup (Default Parameters) To download and prepare the dataset with default settings, choose one of the preconfigured dataset sizes: * **100-hour dataset** ```bash bash use_meta_100h.sh ``` * **500-hour dataset** ```bash bash use_meta_500h.sh ``` * **1000-hour dataset** ```bash bash use_meta_1000h.sh ``` * **2000-hour dataset** ```bash bash use_meta_2000h.sh ``` All metadata can also be downloaded from [Hugging Face – MTUCI](https://huggingface.co/MTUCI). ### Custom Metadata Download If you already have generated metadata files (`balalaika.parquet` and `balalaika.pkl`), place them in the project root and run: ```bash bash use_meta.sh ``` --- ## Running the Pipeline ### Basic Scenario (Local Processing) This scenario will: 1. Download datasets 2. Split audio into semantic chunks 3. Transcribe all segments 4. Perform speaker segmentation 5. Apply phonemization To execute locally, run: ```bash bash base.sh configs/config.yaml ``` All output metadata will be saved in `podcasts/result.csv`. --- ## Configuration The main configuration file is located at `configs/config.yaml`. This file is organized into several sections, each corresponding to a specific stage of the podcast processing pipeline. Below is a detailed explanation of the key parameters within each section. --- ### Global Parameters * `podcasts_path`: It specifies the **absolute path** to the directory where all downloaded podcast files will be stored and where subsequent processing (preprocessing, separation, transcription, etc.) will look for and save its output. --- ### `download` Section This section controls how podcast episodes are downloaded. * `podcasts_path`: (As explained above) The directory where downloaded podcasts will be saved. * `episodes_limit`: This sets a **limit on the number of episodes** to download from a single podcast playlist. * `num_workers`: Specifies the **number of parallel processes** to use for downloading. A higher number can speed up downloads but will consume more system resources. * `podcasts_urls_file`: This parameter points to the **path of a `.pkl` file** that contains a list of podcast URLs to be downloaded. --- ### `preprocess` Section This section handles the initial processing of downloaded audio files, such as chopping them into smaller segments. * `podcasts_path`: (As explained above) The directory containing the raw downloaded podcasts that need to be preprocessed. * `duration`: Defines the **maximum length in seconds** for each audio sample (segment). * `num_workers`: Specifies the **number of parallel processes** to use during preprocessing. * `whisper_model`: Specifies the **name or path of the Faster-Whisper compatible model** to be used for initial audio processing. * `compute_type`: Determines the **computation type** for the Whisper model, affecting performance and memory usage. * `beam_size`: This parameter is related to the **beam search algorithm** used in the Whisper model's decoding process. --- ### `separation` Section This section calculates metrics for each audio * `podcasts_path`: (As explained above) The directory where the chopped podcasts (from the `preprocess` stage) are located. * `num_workers`: The **number of parallel processes** to use for audio separation. * `nisqa_config`: Specifies the **path to the configuration file for NISQA** * `one_speaker`: A **boolean flag** (`True`/`False`) that, when enabled (`True`), instructs the system to download and process only those audio recordings that should contain a single speaker. --- ### `transcription` Section This section is responsible for converting audio into text. * `podcasts_path`: (As explained above) The directory containing the processed audio files ready for transcription. * `model_name`: Specifies the **type of automatic speech recognition (ASR) model** to use. Options typically include `"ctc" or "rnnt"`. * `num_workers`: The **number of parallel processes per GPU** to use for transcription. * `with_timestamps`: A **boolean flag** (`True`/`False`) that, when enabled, allows the transcription process to generate timestamps for each word or segment. **it only works with ctc** * `lm_path`: Specifies the **path to a language model file (`.bin`)**. A language model can improve transcription accuracy by providing contextual information. --- ### `punctuation` Section This section focuses on adding proper punctuation to the transcribed text. * `podcasts_path`: (As explained above) The directory where the transcribed text files are located. * `model_name`: Specifies the **name of the RUPunct model** to be used for punctuation restoration. * `num_workers`: The **number of parallel processes per GPU** to use for punctuation. --- ### `accent` Section In the transcribed text this part is restored with accents. * `podcasts_path`: (As explained above) The directory containing the relevant podcast files. * `num_workers`: The **number of parallel processes per GPU** to use for accent processing. * `model_name`: Specifies the **name of the ruAccent model** to be used. --- ### `phonemizer` Section This section is responsible for converting text into phonetic representations (phonemes). * `podcasts_path`: (As explained above) The directory where the text files (from transcription and punctuation stages) are located. * `num_workers`: The **number of parallel processes per GPU** to use for phonemization. --- ### `classification` Section This section relates to global speaker clustering. * `podcasts_path`: (As explained above) The directory containing the podcast files relevant for classification. * `num_workers`: The **number of parallel processes per GPU** to use for classification. * `threshold`: This is the **speaker classification confidence threshold**. Values typically range from `0.6` to `0.9`. A higher threshold means the model needs to be more confident in its classification to assign a label. * `model_path`: Specifies the **path to the pretrained speaker classification model** in `.pt` format. --- ### Execution Scripts Each processing script (`*_yaml.sh` and `*_args.sh`) offers flexibility in how parameters are provided: * `*_yaml.sh`: These scripts read all necessary parameters directly from the main `config.yaml` file, ensuring consistency across different stages. * `*_args.sh`: These scripts allow for hardcoded arguments directly within the shell script itself, which can be useful for quick tests or specific overrides without modifying the main configuration file. ## Environment Variables Create a `.env` file in the project root with the following: ```ini HF_TOKEN=<your_huggingface_token> YANDEX_KEY=<your_yandex_music_token> ``` * `HF_TOKEN`: Required for speaker count estimation. * `YANDEX_KEY`: Required for dataset downloads. --- ## Important Notes - All scripts must be executed from the **project root directory**. - Paths in the config file must be **absolute**. - The processing scripts (punctuation, accents) should be run **sequentially**. - You’ll need: - Yandex Music API key ([How to get one](https://yandex-music.readthedocs.io/en/main/token.html)) - Hugging Face token ## Models Place all required models under the `models/` directory with the following structure: ``` models/ ├── voxblink_resnet/ # Speaker classification model │ └── ... └── nisqa_s.tar # Audio quality assessment model ``` Supported models: - [NISQA](https://github.com/deepvk/NISQA-s) – Audio quality assessment. - [GigaAM](https://github.com/salute-developers/GigaAM) – ASR. - [ruAccent](https://github.com/Den4ikAI/ruaccent) – Accent restoration. - [RUPunct](https://huggingface.co/RUPunct/RUPunct_big) – Punctuation restoration. - [VoxBlink ResNet](https://github.com/wenet-e2e/wespeaker) – Speaker classification. - [TryIPaG2P](https://github.com/NikiPshg/TryIPaG2P) – Phonemization. - [Speaker Diarization](https://github.com/pyannote/pyannote-audio) – Speaker diarization. - [Whisper](https://github.com/SYSTRAN/faster-whisper) – ASR + segmentation --- ## Citation If you use this pipeline in your research or production, please cite: ``` @misc{borodin2025datacentricframeworkaddressingphonetic, title={A Data-Centric Framework for Addressing Phonetic and Prosodic Challenges in Russian Speech Generative Models}, author={Kirill Borodin and Nikita Vasiliev and Vasiliy Kudryavtsev and Maxim Maslov and Mikhail Gorodnichev and Oleg Rogov and Grach Mkrtchian}, year={2025}, eprint={2507.13563}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2507.13563}, } ``` ---  ## License ### Dataset: Balalaika - **CC BY-NC-ND 4.0** – non-commercial, no derivatives, research use only. - Cite the corpus and do **not** redistribute files without written permission. ### Code - **CC BY-NC-SA 4.0** – You may use, modify, and share the material for academic, non-commercial purposes only. -You must retain the copyright and license notices; contact the authors for commercial use. ### Third-Party Models & Libraries Comply with each component’s original license in addition to the above: | Component | License | |-----------|---------| | NISQA-s | Apache 2.0 | | GigaAM | MIT | | ruAccent | CC BY-NC-ND 4.0 | | RUPunct | CC BY-NC-ND 4.0 | | VoxBlink ResNet | Apache 2.0 | | TryIPaG2P | MIT | | pyannote-audio | MIT | | Faster-Whisper | MIT |

language: - 俄语 license: CC BY-NC-ND 4.0 task_categories: - 文本转语音 pretty_name: 巴拉莱卡（Balalaika） tags: - 俄语 - 语音合成 - 语音增强 - 音频 # 解决俄语语音生成模型语音与韵律挑战的数据中心框架论文: [https://huggingface.co/papers/2507.13563] 代码仓库: [https://github.com/mtuciru/balalaika] 俄语语音合成面临诸多独特挑战，涵盖元音弱化、辅音清化、可变重音模式、同形异义词歧义以及非自然语调等问题。本文提出巴拉莱卡（Balalaika）数据集，这是一个全新的俄语语音数据集，包含超过2000小时的演播室级质量俄语语音，并附带全面的文本标注，包括标点符号与重音标记。实验结果表明，基于巴拉莱卡数据集训练的模型，在语音合成与语音增强任务中的表现均显著优于基于现有数据集训练的模型。 ## 快速开始 👟 bash git clone https://github.com/mtuciru/balalaika && cd balalaika bash create_user_env.sh # 创建虚拟环境并安装Python依赖包 bash use_meta_500h.sh # 根据需求选择100h / 500h / 1000h / 2000h 数据集 ## 目录 1. [前置依赖](#prerequisites) 2. [安装](#installation) 3. [数据准备](#data-preparation) - [快速配置（默认参数）](#quick-setup) - [自定义元数据下载](#custom-metadata-download) 4. [运行处理流程](#running-the-pipeline) - [基础场景（本地处理）](#basic-scenario-local-processing) 5. [配置](#configuration) 6. [环境变量](#environment-variables) 7. [模型](#models) 8. [引用](#citation) 9. [许可证](#license) --- ## 前置依赖确保系统已安装以下工具： bash sudo apt update && sudo apt install -y ffmpeg # 音视频处理工具 python3 # Python解释器 python3-pip # Pip包管理器 python3-venv # 标准库虚拟环境支持 python3-dev # 编译原生Python包所需的头文件 python-is-python3 wget -qO- https://astral.sh/uv/install.sh | sh --- ## 安装克隆仓库并配置运行环境： bash git clone https://github.com/mtuciru/balalaika cd balalaika # 如需对数据集进行标注或修改，执行此脚本 bash create_dev_env.sh # 仅需使用预标注数据集时，执行此脚本 bash create_user_env.sh --- ## 数据准备 ### 快速配置（默认参数）要下载并使用默认配置处理数据集，请选择以下预配置的数据集规模之一： * **100小时数据集** bash bash use_meta_100h.sh * **500小时数据集** bash bash use_meta_500h.sh * **1000小时数据集** bash bash use_meta_1000h.sh * **2000小时数据集** bash bash use_meta_2000h.sh 所有元数据也可从[Hugging Face – MTUCI](https://huggingface.co/MTUCI)下载。 ### 自定义元数据下载如果您已拥有生成好的元数据文件（`balalaika.parquet`与`balalaika.pkl`），请将其放置在项目根目录后执行： bash bash use_meta.sh --- ## 运行处理流程 ### 基础场景（本地处理）该场景将完成以下流程： 1. 下载数据集 2. 将音频切割为语义分段 3. 对所有分段进行转录 4. 执行说话人分割 5. 进行音素化处理要在本地执行该流程，请运行： bash bash base.sh configs/config.yaml 所有输出元数据将保存至`podcasts/result.csv`。 --- ## 配置主配置文件位于`configs/config.yaml`，该文件分为多个章节，分别对应语音处理流程的不同阶段。下文将逐一详解各章节中的关键参数。 --- ### 全局参数 * `podcasts_path`: 指定**绝对路径**，用于存储所有下载的播客文件，同时也是后续预处理、分离、转录等步骤查找与保存输出文件的目录。 --- ### `download` 章节该章节用于控制播客剧集的下载流程。 * `podcasts_path`: （同上文说明）下载的播客文件存储目录。 * `episodes_limit`: 设置**从单个播客播放列表下载的剧集数量上限**。 * `num_workers`: 指定**下载过程中使用的并行进程数**，数值越高下载速度越快，但会占用更多系统资源。 * `podcasts_urls_file`: 指向包含待下载播客URL列表的**.pkl文件路径**。 --- ### `preprocess` 章节该章节用于处理已下载的音频文件，例如将其切割为更小的分段。 * `podcasts_path`: （同上文说明）存储原始下载播客的目录。 * `duration`: 定义每个音频分段的**最大时长（秒）**。 * `num_workers`: 指定预处理阶段使用的并行进程数。 * `whisper_model`: 指定**兼容Faster-Whisper的模型名称或路径**，用于初始音频处理。 * `compute_type`: 决定Whisper模型的**计算类型**，影响模型性能与内存占用。 * `beam_size`: 该参数与Whisper模型解码过程中使用的**束搜索算法**相关。 --- ### `separation` 章节该章节用于计算每个音频的质量指标。 * `podcasts_path`: （同上文说明）存储预处理阶段切割后的播客文件的目录。 * `num_workers`: 音频分离阶段使用的并行进程数。 * `nisqa_config`: 指定**NISQA配置文件路径**。 * `one_speaker`: **布尔标志（True/False）**，当启用（True）时，系统将仅下载并处理仅包含单个说话人的音频录音。 --- ### `transcription` 章节该章节负责将音频转换为文本。 * `podcasts_path`: （同上文说明）存储待转录的已处理音频文件的目录。 * `model_name`: 指定**自动语音识别（ASR）模型类型**，可选值通常包括`"ctc"`或`"rnnt"`。 * `num_workers`: **每张GPU上使用的并行进程数**，用于转录任务。 * `with_timestamps`: **布尔标志（True/False）**，启用后将为每个单词或分段生成时间戳，**仅适用于ctc模型**。 * `lm_path`: 指定**语言模型文件（.bin）路径**，语言模型可通过提供上下文信息提升转录准确率。 --- ### `punctuation` 章节该章节用于为转录后的文本添加正确的标点符号。 * `podcasts_path`: （同上文说明）存储转录文本文件的目录。 * `model_name`: 指定**用于标点恢复的RUPunct模型名称**。 * `num_workers`: **每张GPU上使用的并行进程数**，用于标点恢复任务。 --- ### `accent` 章节该章节用于恢复转录文本中的重音标记。 * `podcasts_path`: （同上文说明）存储相关播客文件的目录。 * `num_workers`: **每张GPU上使用的并行进程数**，用于重音处理任务。 * `model_name`: 指定**使用的ruAccent模型名称**。 --- ### `phonemizer` 章节该章节负责将文本转换为语音表示（音素）。 * `podcasts_path`: （同上文说明）存储转录与标点恢复阶段生成的文本文件的目录。 * `num_workers`: **每张GPU上使用的并行进程数**，用于音素化任务。 --- ### `classification` 章节该章节与全局说话人聚类相关。 * `podcasts_path`: （同上文说明）存储分类相关播客文件的目录。 * `num_workers`: **每张GPU上使用的并行进程数**，用于分类任务。 * `threshold`: **说话人分类置信度阈值**，取值范围通常为0.6至0.9，阈值越高，模型在分配标签时需要更高的置信度。 * `model_path`: 指定**预训练说话人分类模型（.pt格式）的路径**。 --- ### 执行脚本各处理脚本（`*_yaml.sh`与`*_args.sh`）提供了灵活的参数传递方式： * `*_yaml.sh`: 此类脚本直接从主`config.yaml`文件读取所有必要参数，确保各阶段配置的一致性。 * `*_args.sh`: 此类脚本允许直接在Shell脚本中硬编码参数，适用于快速测试或无需修改主配置文件的特定参数覆盖场景。 --- ## 环境变量在项目根目录创建`.env`文件，并添加以下内容： ini HF_TOKEN=<your_huggingface_token> YANDEX_KEY=<your_yandex_music_token> * `HF_TOKEN`: 用于说话人数量估计，必填。 * `YANDEX_KEY`: 用于数据集下载，必填。 --- ## 重要提示 - 所有脚本必须在**项目根目录**下执行。 - 配置文件中的路径必须为**绝对路径**。 - 处理脚本（标点恢复、重音处理等）需**按顺序执行**。 - 您需要： - Yandex Music API密钥（[获取方式](https://yandex-music.readthedocs.io/en/main/token.html)） - Hugging Face令牌 --- ## 模型将所有所需模型放置在`models/`目录下，结构如下： models/ ├── voxblink_resnet/ # 说话人分类模型 │ └── ... └── nisqa_s.tar # 音频质量评估模型支持的模型： - [NISQA](https://github.com/deepvk/NISQA-s) – 音频质量评估。 - [GigaAM](https://github.com/salute-developers/GigaAM) – 自动语音识别（ASR）。 - [ruAccent](https://github.com/Den4ikAI/ruaccent) – 重音恢复。 - [RUPunct](https://huggingface.co/RUPunct/RUPunct_big) – 标点恢复。 - [VoxBlink ResNet](https://github.com/wenet-e2e/wespeaker) – 说话人分类。 - [TryIPaG2P](https://github.com/NikiPshg/TryIPaG2P) – 音素化。 - [Speaker Diarization](https://github.com/pyannote/pyannote-audio) – 说话人分段。 - [Whisper](https://github.com/SYSTRAN/faster-whisper) – 自动语音识别与音频分段。 --- ## 引用如果您在研究或生产环境中使用该处理流程，请引用以下文献： @misc{borodin2025datacentricframeworkaddressingphonetic, title={A Data-Centric Framework for Addressing Phonetic and Prosodic Challenges in Russian Speech Generative Models}, author={Kirill Borodin and Nikita Vasiliev and Vasiliy Kudryavtsev and Maxim Maslov and Mikhail Gorodnichev and Oleg Rogov and Grach Mkrtchian}, year={2025}, eprint={2507.13563}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2507.13563}, } --- ## 许可证 ### 数据集：巴拉莱卡（Balalaika） - **CC BY-NC-ND 4.0** – 仅允许非商业使用，禁止修改衍生作品，仅限科研用途。 - 请引用该数据集，且**未经书面许可不得重新分发数据集文件**。 ### 代码 - **CC BY-NC-SA 4.0** – 仅可用于学术与非商业目的，可对代码进行使用、修改与分享。 - 请保留版权与许可证声明；商业使用需联系作者获取许可。 ### 第三方模型与库除上述许可证外，请同时遵守各组件的原始许可证： | 组件 | 许可证 | |-----------|---------| | NISQA-s | Apache 2.0 | | GigaAM | MIT | | ruAccent | CC BY-NC-ND 4.0 | | RUPunct | CC BY-NC-ND 4.0 | | VoxBlink ResNet | Apache 2.0 | | TryIPaG2P | MIT | | pyannote-audio | MIT | | Faster-Whisper | MIT |

提供机构：

MTUCI

5,000+

优质数据集

54 个

任务类型

进入经典数据集