MTUCI/Balalaika2000H

Name: MTUCI/Balalaika2000H
Creator: MTUCI
Published: 2025-07-22 19:23:01
License: 暂无描述

Hugging Face2025-07-22 更新2026-01-03 收录

下载链接：

https://hf-mirror.com/datasets/MTUCI/Balalaika2000H

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - ru license: cc-by-nc-nd-4.0 task_categories: - text-to-speech pretty_name: Balalaika tags: - russian --- # A Data-Centric Framework for Addressing Phonetic and Prosodic Challenges in Russian Speech Generative Models Russian speech synthesis presents distinctive challenges, including vowel reduction, consonant devoicing, variable stress patterns, homograph ambiguity, and unnatural intonation. This paper introduces Balalaika, a novel dataset comprising more than 2,000 hours of studio-quality Russian speech with comprehensive textual annotations, including punctuation and stress markings. Experimental results show that models trained on Balalaika significantly outperform those trained on existing datasets in both speech synthesis and enhancement tasks. Paper: [A Data-Centric Framework for Addressing Phonetic and Prosodic Challenges in Russian Speech Generative Models](https://huggingface.co/papers/2507.13563) Code: https://github.com/mtuciru/balalaika --- ## Quick Start 👟 ```bash git clone https://github.com/mtuciru/balalaika && cd balalaika bash create_user_env.sh # sets up venv + pip deps bash use_meta_500h.sh # pick 100h / 500h / 1000h / 2000h as needed ``` ## Table of Contents 1. [Prerequisites](#prerequisites) 2. [Installation](#installation) 3. [Data Preparation](#data-preparation) - [Quick Setup (Default Parameters)](#quick-setup) - [Custom Metadata Download](#custom-metadata-download) 4. [Running the Pipeline](#running-the-pipeline) - [Basic Scenario (Local Processing)](#basic-scenario-local-processing) 5. [Configuration](#configuration) 6. [Environment Variables](#environment-variables) 7. [Models](#models) 8. [Citation](#citation)  9. [License](#license) --- ## Prerequisites Ensure you have the following tools installed on your system: ```bash sudo apt update && sudo apt install -y \ ffmpeg \ # video/audio toolkit python3 \ # Python python3-pip \ # Pip package manager python3-venv \ # std-lib virtual-env support python3-dev \ # headers for compiling native wheels python-is-python3 wget -qO- https://astral.sh/uv/install.sh | sh ``` --- ## Installation Clone the repository and set up the environment: ```bash git clone https://github.com/mtuciru/balalaika cd balalaika # Use this if you want to annotate/modify the dataset bash create_dev_env.sh # Use this if you only want to use the pre-annotated dataset bash create_user_env.sh ``` --- ## Data Preparation ### Quick Setup (Default Parameters) To download and prepare the dataset with default settings, choose one of the preconfigured dataset sizes: * **100-hour dataset** ```bash bash use_meta_100h.sh ``` * **500-hour dataset** ```bash bash use_meta_500h.sh ``` * **1000-hour dataset** ```bash bash use_meta_1000h.sh ``` * **2000-hour dataset** ```bash bash use_meta_2000h.sh ``` All metadata can also be downloaded from [Hugging Face – MTUCI](https://huggingface.co/MTUCI). ### Custom Metadata Download If you already have generated metadata files (`balalaika.parquet` and `balalaika.pkl`), place them in the project root and run: ```bash bash use_meta.sh ``` --- ## Running the Pipeline ### Basic Scenario (Local Processing) This scenario will: 1. Download datasets 2. Split audio into semantic chunks 3. Transcribe all segments 4. Perform speaker segmentation 5. Apply phonemization To execute locally, run: ```bash bash base.sh configs/config.yaml ``` All output metadata will be saved in `podcasts/result.csv`. --- ## Configuration The main configuration file is located at `configs/config.yaml`. This file is organized into several sections, each corresponding to a specific stage of the podcast processing pipeline. Below is a detailed explanation of the key parameters within each section. --- ### Global Parameters * `podcasts_path`: It specifies the **absolute path** to the directory where all downloaded podcast files will be stored and where subsequent processing (preprocessing, separation, transcription, etc.) will look for and save its output. --- ### `download` Section This section controls how podcast episodes are downloaded. * `podcasts_path`: (As explained above) The directory where downloaded podcasts will be saved. * `episodes_limit`: This sets a **limit on the number of episodes** to download from a single podcast playlist. * `num_workers`: Specifies the **number of parallel processes** to use for downloading. A higher number can speed up downloads but will consume more system resources. * `podcasts_urls_file`: This parameter points to the **path of a `.pkl` file** that contains a list of podcast URLs to be downloaded. --- ### `preprocess` Section This section handles the initial processing of downloaded audio files, such as chopping them into smaller segments. * `podcasts_path`: (As explained above) The directory containing the raw downloaded podcasts that need to be preprocessed. * `duration`: Defines the **maximum length in seconds** for each audio sample (segment). * `num_workers`: Specifies the **number of parallel processes** to use during preprocessing. * `whisper_model`: Specifies the **name or path of the Faster-Whisper compatible model** to be used for initial audio processing. * `compute_type`: Determines the **computation type** for the Whisper model, affecting performance and memory usage. * `beam_size`: This parameter is related to the **beam search algorithm** used in the Whisper model's decoding process. --- ### `separation` Section This section calculates metrics for each audio * `podcasts_path`: (As explained above) The directory where the chopped podcasts (from the `preprocess` stage) are located. * `num_workers`: The **number of parallel processes** to use for audio separation. * `nisqa_config`: Specifies the **path to the configuration file for NISQA** * `one_speaker`: A **boolean flag** (`True`/`False`) that, when enabled (`True`), instructs the system to download and process only those audio recordings that should contain a single speaker. --- ### `transcription` Section This section is responsible for converting audio into text. * `podcasts_path`: (As explained above) The directory containing the processed audio files ready for transcription. * `model_name`: Specifies the **type of automatic speech recognition (ASR) model** to use. Options typically include `"ctc" or "rnnt"`. * `num_workers`: The **number of parallel processes per GPU** to use for transcription. * `with_timestamps`: A **boolean flag** (`True`/`False`) that, when enabled, allows the transcription process to generate timestamps for each word or segment. **it only works with ctc** * `lm_path`: Specifies the **path to a language model file (`.bin`)**. A language model can improve transcription accuracy by providing contextual information. --- ### `punctuation` Section This section focuses on adding proper punctuation to the transcribed text. * `podcasts_path`: (As explained above) The directory where the transcribed text files are located. * `model_name`: Specifies the **name of the RUPunct model** to be used for punctuation restoration. * `num_workers`: The **number of parallel processes per GPU** to use for punctuation. --- ### `accent` Section In the transcribed text this part is restored with accents. * `podcasts_path`: (As explained above) The directory containing the relevant podcast files. * `num_workers`: The **number of parallel processes per GPU** to use for accent processing. * `model_name`: Specifies the **name of the ruAccent model** to be used. --- ### `phonemizer` Section This section is responsible for converting text into phonetic representations (phonemes). * `podcasts_path`: (As explained above) The directory where the text files (from transcription and punctuation stages) are located. * `num_workers`: The **number of parallel processes per GPU** to use for phonemization. --- ### `classification` Section This section relates to global speaker clustering. * `podcasts_path`: (As explained above) The directory containing the podcast files relevant for classification. * `num_workers`: The **number of parallel processes per GPU** to use for classification. * `threshold`: This is the **speaker classification confidence threshold**. Values typically range from `0.6` to `0.9`. A higher threshold means the model needs to be more confident in its classification to assign a label. * `model_path`: Specifies the **path to the pretrained speaker classification model** in `.pt` format. --- ### Execution Scripts Each processing script (`*_yaml.sh` and `*_args.sh`) offers flexibility in how parameters are provided: * `*_yaml.sh`: These scripts read all necessary parameters directly from the main `config.yaml` file, ensuring consistency across different stages. * `*_args.sh`: These scripts allow for hardcoded arguments directly within the shell script itself, which can be useful for quick tests or specific overrides without modifying the main configuration file. ## Environment Variables Create a `.env` file in the project root with the following: ```ini HF_TOKEN=<your_huggingface_token> YANDEX_KEY=<your_yandex_music_token> ``` * `HF_TOKEN`: Required for speaker count estimation. * `YANDEX_KEY`: Required for dataset downloads. --- ## Important Notes - All scripts must be executed from the **project root directory**. - Paths in the config file must be **absolute**. - The processing scripts (punctuation, accents) should be run **sequentially**. - You’ll need: - Yandex Music API key ([How to get one](https://yandex-music.readthedocs.io/en/main/token.html)) - Hugging Face token ## Models Place all required models under the `models/` directory with the following structure: ``` models/ ├── voxblink_resnet/ # Speaker classification model │ └── ... └── nisqa_s.tar # Audio quality assessment model ``` Supported models: - [NISQA](https://github.com/deepvk/NISQA-s) – Audio quality assessment. - [GigaAM](https://github.com/salute-developers/GigaAM) – ASR. - [ruAccent](https://github.com/Den4ikAI/ruaccent) – Accent restoration. - [RUPunct](https://huggingface.co/RUPunct/RUPunct_big) – Punctuation restoration. - [VoxBlink ResNet](https://github.com/wenet-e2e/wespeaker) – Speaker classification. - [TryIPaG2P](https://github.com/NikiPshg/TryIPaG2P) – Phonemization. - [Speaker Diarization](https://github.com/pyannote/pyannote-audio) – Speaker diarization. - [Whisper](https://github.com/SYSTRAN/faster-whisper) – ASR + segmentation --- ## Citation If you use this pipeline in your research or production, please cite: ``` @misc{borodin2025datacentricframeworkaddressingphonetic, title={A Data-Centric Framework for Addressing Phonetic and Prosodic Challenges in Russian Speech Generative Models}, author={Kirill Borodin and Nikita Vasiliev and Vasiliy Kudryavtsev and Maxim Maslov and Mikhail Gorodnichev and Oleg Rogov and Grach Mkrtchian}, year={2025}, eprint={2507.13563}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2507.13563}, } ``` ---  ## License ### Dataset: Balalaika - **CC BY-NC-ND 4.0** – non-commercial, no derivatives, research use only. - Cite the corpus and do **not** redistribute files without written permission. ### Code - **CC BY-NC-SA 4.0** – You may use, modify, and share the material for academic, non-commercial purposes only. -You must retain the copyright and license notices; contact the authors for commercial use. ### Third-Party Models & Libraries Comply with each component’s original license in addition to the above: | Component | License | |-----------|---------| | NISQA-s | Apache 2.0 | | GigaAM | MIT | | ruAccent | CC BY-NC-ND 4.0 | | RUPunct | CC BY-NC-ND 4.0 | | VoxBlink ResNet | Apache 2.0 | | TryIPaG2P | MIT | | pyannote-audio | MIT | | Faster-Whisper | MIT |

--- 语言： - 俄语许可证：CC BY-NC-ND 4.0 任务类别： - 文本转语音（text-to-speech）友好名称：巴拉莱卡（Balalaika）标签： - 俄语 --- # 以数据为中心的框架：解决俄语语音生成模型中的语音学与韵律学挑战俄语语音合成面临诸多独特挑战，涵盖元音弱化、辅音清化、可变重音模式、同形异义词歧义以及不自然语调等问题。本研究提出巴拉莱卡（Balalaika）数据集，这是一款全新的俄语语音数据集，包含超过2000小时的录音棚级质量俄语语音，并附带全面的文本标注，涵盖标点与重音标记。实验结果表明，基于巴拉莱卡数据集训练的模型，在语音合成与语音增强任务中均显著优于基于现有数据集训练的模型。论文：[《以数据为中心的框架：解决俄语语音生成模型中的语音学与韵律学挑战》](https://huggingface.co/papers/2507.13563) 代码：https://github.com/mtuciru/balalaika ## 快速启动 👟 bash git clone https://github.com/mtuciru/balalaika && cd balalaika bash create_user_env.sh # 配置虚拟环境并安装Python依赖包 bash use_meta_500h.sh # 可按需选择100h、500h、1000h或2000h数据集规模 ## 目录 1. [前置依赖](#prerequisites) 2. [安装配置](#installation) 3. [数据准备](#data-preparation) - [快速配置（默认参数）](#quick-setup-default-parameters) - [自定义元数据下载](#custom-metadata-download) 4. [运行处理流水线](#running-the-pipeline) - [基础场景（本地处理）](#basic-scenario-local-processing) 5. [配置参数](#configuration) 6. [环境变量](#environment-variables) 7. [模型资源](#models) 8. [引用说明](#citation) 9. [许可证](#license) --- ## 前置依赖请确保系统中已安装以下工具： bash sudo apt update && sudo apt install -y ffmpeg # 音视频处理工具包 python3 # Python编程语言环境 python3-pip # Python包管理器 python3-venv # Python标准库虚拟环境支持工具 python3-dev # 编译原生Python轮子所需的头文件 python-is-python3 wget -qO- https://astral.sh/uv/install.sh | sh --- ## 安装配置克隆仓库并配置运行环境： bash git clone https://github.com/mtuciru/balalaika cd balalaika # 如需对数据集进行标注或修改，请执行以下脚本 bash create_dev_env.sh # 仅需使用预标注数据集时，请执行以下脚本 bash create_user_env.sh --- ## 数据准备 ### 快速配置（默认参数）通过默认设置下载并准备数据集时，可选择以下预配置的数据集规模： * **100小时数据集** bash bash use_meta_100h.sh * **500小时数据集** bash bash use_meta_500h.sh * **1000小时数据集** bash bash use_meta_1000h.sh * **2000小时数据集** bash bash use_meta_2000h.sh 所有元数据也可从[Hugging Face – MTUCI](https://huggingface.co/MTUCI)下载。 ### 自定义元数据下载若您已生成元数据文件（`balalaika.parquet`与`balalaika.pkl`），请将其放置于项目根目录后执行： bash bash use_meta.sh --- ## 运行处理流水线 ### 基础场景（本地处理）该场景将完成以下流程： 1. 下载数据集 2. 将音频切割为语义片段 3. 对所有片段进行转录 4. 执行说话人分割 5. 进行音素化（phonemization）如需本地执行，请运行： bash bash base.sh configs/config.yaml 所有输出元数据将保存至`podcasts/result.csv`。 --- ## 配置参数主配置文件位于`configs/config.yaml`，该文件按播客处理流水线的不同阶段划分为多个配置段。以下为各配置段中关键参数的详细说明。 ### 全局参数 * `podcasts_path`：指定所有下载的播客文件的存储目录，同时也是后续处理（预处理、分离、转录等）环节查找与保存输出结果的目录。 ### `download` 配置段该配置段用于控制播客剧集的下载流程。 * `podcasts_path`：同上文所述，为下载播客的存储目录。 * `episodes_limit`：设置从单个播客播放列表中下载的剧集数量上限。 * `num_workers`：指定下载环节使用的并行进程数。数值越高，下载速度越快，但会占用更多系统资源。 * `podcasts_urls_file`：指向包含待下载播客URL列表的`.pkl`文件路径。 ### `preprocess` 配置段该配置段用于处理已下载的音频文件，例如将其切割为更小的音频片段。 * `podcasts_path`：同上文所述，存储原始下载播客的目录。 * `duration`：定义每个音频片段的最大时长（单位：秒）。 * `num_workers`：指定预处理环节使用的并行进程数。 * `whisper_model`：指定用于初始音频处理的、兼容Faster-Whisper的模型名称或路径。 * `compute_type`：确定Whisper模型的计算类型，该参数会影响模型性能与内存占用。 * `beam_size`：该参数与Whisper模型解码过程中使用的束搜索（beam search）算法相关。 ### `separation` 配置段该配置段用于为每个音频计算各项指标。 * `podcasts_path`：同上文所述，存储预处理阶段切割后的播客音频的目录。 * `num_workers`：指定音频分离环节使用的并行进程数。 * `nisqa_config`：指定NISQA（语音质量评估工具）的配置文件路径。 * `one_speaker`：布尔标志（`True`/`False`），当设置为`True`时，系统将仅下载并处理仅包含单个说话人的音频录音。 ### `transcription` 配置段该配置段负责将音频转换为文本内容。 * `podcasts_path`：同上文所述，存储待转录的已处理音频文件的目录。 * `model_name`：指定使用的自动语音识别（Automatic Speech Recognition, ASR）模型类型，可选值通常包括`"ctc"`或`"rnnt"`。 * `num_workers`：指定每个GPU用于转录环节的并行进程数。 * `with_timestamps`：布尔标志（`True`/`False`），启用后将为每个单词或片段生成时间戳，**该功能仅在使用ctc模型时生效**。 * `lm_path`：指定语言模型（`.bin`格式文件）的路径。语言模型可通过提供上下文信息提升转录准确率。 ### `punctuation` 配置段该配置段用于为转录后的文本添加规范标点。 * `podcasts_path`：同上文所述，存储转录文本文件的目录。 * `model_name`：指定用于标点恢复的RUPunct模型名称。 * `num_workers`：指定每个GPU用于标点恢复环节的并行进程数。 ### `accent` 配置段该模块将为转录文本恢复重音标记。 * `podcasts_path`：同上文所述，存储相关播客文件的目录。 * `num_workers`：指定每个GPU用于重音处理环节的并行进程数。 * `model_name`：指定使用的ruAccent模型名称。 ### `phonemizer` 配置段该配置段负责将文本转换为音素表示形式（音素化）。 * `podcasts_path`：同上文所述，存储转录与标点处理后的文本文件的目录。 * `num_workers`：指定每个GPU用于音素化环节的并行进程数。 ### `classification` 配置段该配置段用于全局说话人聚类。 * `podcasts_path`：同上文所述，存储与分类相关的播客文件的目录。 * `num_workers`：指定每个GPU用于分类环节的并行进程数。 * `threshold`：说话人分类的置信度阈值，取值范围通常为`0.6`至`0.9`。阈值越高，模型在分配标签时需要具备更高的置信度。 * `model_path`：指定预训练说话人分类模型（`.pt`格式）的路径。 ### 执行脚本每个处理脚本（`*_yaml.sh`与`*_args.sh`）均提供了灵活的参数传入方式： * `*_yaml.sh`：此类脚本直接从主配置文件`config.yaml`读取所有必要参数，确保各处理阶段的参数一致性。 * `*_args.sh`：此类脚本允许直接在Shell脚本中硬编码参数，适用于快速测试或无需修改主配置文件即可覆盖特定参数的场景。 ## 环境变量在项目根目录创建`.env`文件，并添加以下内容： ini HF_TOKEN=<your_huggingface_token> YANDEX_KEY=<your_yandex_music_token> * `HF_TOKEN`：用于说话人数量估计，为必填项。 * `YANDEX_KEY`：用于数据集下载，为必填项。 --- ## 重要注意事项 - 所有脚本均需从**项目根目录**执行。 - 配置文件中的路径必须为**绝对路径**。 - 标点恢复、重音处理等处理脚本需**按顺序执行**。 - 您需要准备： - Yandex Music API密钥（[获取方式](https://yandex-music.readthedocs.io/en/main/token.html)） - Hugging Face令牌 ## 模型资源将所有所需模型放置于`models/`目录下，结构如下： models/ ├── voxblink_resnet/ # 说话人分类模型 │ └── ... └── nisqa_s.tar # 语音质量评估模型支持的模型包括： - [NISQA](https://github.com/deepvk/NISQA-s) – 语音质量评估。 - [GigaAM](https://github.com/salute-developers/GigaAM) – 自动语音识别。 - [ruAccent](https://github.com/Den4ikAI/ruaccent) – 重音标记恢复。 - [RUPunct](https://huggingface.co/RUPunct/RUPunct_big) – 标点符号恢复。 - [VoxBlink ResNet](https://github.com/wenet-e2e/wespeaker) – 说话人分类。 - [TryIPaG2P](https://github.com/NikiPshg/TryIPaG2P) – 音素化处理。 - [Speaker Diarization](https://github.com/pyannote/pyannote-audio) – 说话人 diarization (Speaker Diarization)。 - [Whisper](https://github.com/SYSTRAN/faster-whisper) – 自动语音识别与音频切割。 --- ## 引用说明若您在研究或生产环境中使用本处理流水线，请引用以下文献： @misc{borodin2025datacentricframeworkaddressingphonetic, title={A Data-Centric Framework for Addressing Phonetic and Prosodic Challenges in Russian Speech Generative Models}, author={Kirill Borodin and Nikita Vasiliev and Vasiliy Kudryavtsev and Maxim Maslov and Mikhail Gorodnichev and Oleg Rogov and Grach Mkrtchian}, year={2025}, eprint={2507.13563}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2507.13563}, } --- ## 许可证 ### 数据集：巴拉莱卡（Balalaika） - **CC BY-NC-ND 4.0** – 仅可用于非商业用途，禁止修改衍生作品，仅限科研使用。 - 请引用该数据集，且未经书面许可不得重新分发数据集文件。 ### 代码 - **CC BY-NC-SA 4.0** – 仅可将本代码用于学术研究与非商业用途，您可对其进行使用、修改与分享。 - 请保留版权与许可证声明；商业使用需联系作者获取许可。 ### 第三方模型与库除本项目的许可证要求外，请同时遵守各组件的原始许可证： | 组件 | 许可证 | |-----------|---------| | NISQA-s | Apache 2.0 | | GigaAM | MIT | | ruAccent | CC BY-NC-ND 4.0 | | RUPunct | CC BY-NC-ND 4.0 | | VoxBlink ResNet | Apache 2.0 | | TryIPaG2P | MIT | | pyannote-audio | MIT | | Faster-Whisper | MIT |

提供机构：

MTUCI

5,000+

优质数据集

54 个

任务类型

进入经典数据集