ESpeech-buldjat
收藏魔搭社区2025-12-05 更新2025-08-30 收录
下载链接:
https://modelscope.cn/datasets/ESpeech/ESpeech-buldjat
下载链接
链接失效反馈官方服务:
资源简介:
# Buldjat YouTube Audio Dataset
## Dataset Description
This dataset contains 54 hours of processed audio segments extracted from the "Buldjat" YouTube channel with corresponding metadata. Each audio file represents a segment from the channel's videos and content, processed at 44.1kHz sample rate.
### Dataset Summary
- **Language**: Russian
- **Task**: TTS, ASR, Quality Assessment
- **Audio format**: MP3, 44.1kHz sample rate
- **Structure**: Segmented audio files with JSON metadata
- **Source**: Buldjat YouTube channel content
## Dataset Structure
### Data Fields
#### Basic Information
- `audio`: Audio data (44.1kHz sample rate, MP3 format)
- `file_name`: Name of the audio segment file (format: `<original_name>_<idx>.mp3`)
- `segment_index`: Index of the audio segment within the original video
- `original_name`: Original name of the YouTube video recording
#### Transcription and Timing
- `text`: Transcribed text of the audio segment
- `start`: Start time of the segment in seconds
- `end`: End time of the segment in seconds
- `words`: Word-level timestamps and confidence scores
#### Speaker Information
- `speaker`: Speaker identifier (e.g., "SPEAKER_00")
#### Quality Metrics
- `emos_overall`: EMOS overall quality score
- `noise_confidence`: Noise detection confidence

#### Segment Structure
- `num_sentences`: Number of sentences (for merged segments)
- `original_segments`: Original subsegments data (for merged segments)
#### VAD (Voice Activity Detection)
- `vad_trimmed`: Whether VAD trimming was applied
- `vad_start`: VAD start time
- `trim_ratio`: Ratio of trimmed audio
### Data Splits
- **Train**: All available YouTube video segments
## Dataset Creation
### Source Data
The dataset consists of audio content extracted from the "Buldjat" YouTube channel. The channel produces various types of content primarily in Russian. Each YouTube video has been processed and segmented into multiple audio clips, with each segment saved as a separate MP3 file along with its transcription and metadata.
## Usage
### Loading the Dataset
Load and extract the tar archive file using:
```bash
tar -xf buldjat_stripped_archive.tar
```
### Citation Information
```bibtex
@dataset{buldjat_youtube_audio_dataset,
title={Buldjat YouTube Audio Dataset},
author={Denis Petrov},
year={2025},
url={https://huggingface.co/datasets/ESpeech/ESpeech-buldjat/}
}
```
# Buldjat YouTube 音频数据集
## 数据集描述
本数据集涵盖从"Buldjat"YouTube频道提取的54小时处理后音频片段及对应元数据。每个音频文件均取自该频道视频内容的片段,采样率为44.1kHz。
### 数据集摘要
- **语言**:俄语
- **任务**:文本转语音(Text-to-Speech, TTS)、自动语音识别(Automatic Speech Recognition, ASR)、质量评估
- **音频格式**:MP3,44.1kHz采样率
- **结构**:带JSON元数据的分段音频文件
- **来源**:Buldjat YouTube频道内容
## 数据集结构
### 数据字段
#### 基础信息
- `audio`:音频数据(44.1kHz采样率,MP3格式)
- `file_name`:音频片段文件名(格式:`<original_name>_<idx>.mp3`)
- `segment_index`:原始视频内的音频片段索引
- `original_name`:YouTube原视频的名称
#### 转录与时序信息
- `text`:音频片段的转录文本
- `start`:片段的起始时间(单位:秒)
- `end`:片段的结束时间(单位:秒)
- `words`:词级时序与置信度得分
#### 说话人信息
- `speaker`:说话人标识符(例如:"SPEAKER_00")
#### 质量指标
- `emos_overall`:EMOS整体质量得分
- `noise_confidence`:噪声检测置信度

#### 片段结构
- `num_sentences`:句子数量(适用于合并后的片段)
- `original_segments`:原始子片段数据(适用于合并后的片段)
#### 语音活动检测(Voice Activity Detection, VAD)
- `vad_trimmed`:是否应用了VAD修剪
- `vad_start`:VAD起始时间
- `trim_ratio`:音频修剪比例
### 数据划分
- **训练集**:所有可用的YouTube视频片段
## 数据集构建
### 源数据
本数据集的音频内容均提取自"Buldjat"YouTube频道。该频道主要制作俄语各类内容。每个YouTube视频均经过处理并分割为多个音频片段,每个片段以独立MP3文件形式保存,并附带转录文本与元数据。
## 使用说明
### 加载数据集
使用以下命令解压tar归档文件:
bash
tar -xf buldjat_stripped_archive.tar
### 引用信息
bibtex
@dataset{buldjat_youtube_audio_dataset,
title={Buldjat YouTube 音频数据集},
author={Denis Petrov},
year={2025},
url={https://huggingface.co/datasets/ESpeech/ESpeech-buldjat/}
}
提供机构:
maas
创建时间:
2025-08-28



