khursanirevo/multiturn_ks_embedded
收藏Hugging Face2026-04-09 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/khursanirevo/multiturn_ks_embedded
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
task_categories:
- automatic-speech-recognition
language:
- en
- ms
- zh
- ru
- id
- ar
- ja
- ko
multilinguality:
- highly_multilingual
size_categories:
- 10K<n<100K
---
# khursanirevo/multiturn_ks_embedded
## Dataset Description
Multiturn dialogue dataset with **embedded audio** and multi-language transcripts from 3 YouTube videos.
### Features
- **Audio**: Embedded stereo audio (WAV format, bytes embedded directly in dataset)
- **Segments**: Speaker turn-level annotations with timestamps for English and Malay
- **Multi-language**: Transcripts in 9 languages (en, ms, zh-Hans, zh-Hant, ru, id, ar, ja, ko)
- **Video ID**: YouTube video identifier for each chunk
- **Chunking**: 30-second chunks with 0.5s overlap
- **Self-contained**: No external audio files needed
### Columns
- `audio`: Embedded stereo audio as bytes (WAV, 24kHz)
- `video_id`: YouTube video identifier
- `sentence`: Full transcript for the chunk (English)
- `segments_en`: JSON list of English speaker turns with speaker, start, end, text fields
- `segments_ms`: JSON list of Malay speaker turns with speaker, start, end, text fields
- `total_speakers`: Number of speakers in chunk (typically 2)
- `sentence_ms`, `sentence_en`, etc.: Transcripts in each language
### Usage
```python
from datasets import load_dataset
import json
import io
import soundfile as sf
# Load dataset
dataset = load_dataset("khursanirevo/multiturn_ks_embedded")
# Access a chunk
chunk = dataset[0]
# Load embedded audio
audio_bytes = chunk["audio"]
buffer = io.BytesIO(audio_bytes)
audio, sample_rate = sf.read(buffer)
print(f"Audio shape: {audio.shape}")
print(f"Sample rate: {sample_rate}")
print(f"Duration: {len(audio)/sample_rate:.1f}s")
# Access speaker turns
video_id = chunk["video_id"]
segments_en = json.loads(chunk["segments_en"])
segments_ms = json.loads(chunk["segments_ms"])
print(f"From video: {video_id}")
print(f"\nEnglish segments:")
for seg in segments_en[:3]:
speaker = seg['speaker']
start = seg['start']
end = seg['end']
text = seg['text'][:60]
print(f" Speaker {speaker} ({start}s-{end}s): {text}...")
```
### Audio Format
Audio is embedded as WAV bytes in the dataset:
- **Format**: WAV (PCM)
- **Sample rate**: 24kHz
- **Channels**: 2 (stereo, speaker separation)
- **Bit depth**: 32-bit float
- **Size**: ~2-2.5MB per 30-second chunk
### Speaker Detection
Speakers are detected using RMS energy analysis:
- Channel 0 (left): Speaker 0
- Channel 1 (right): Speaker 1
### Languages
Supported languages:
- English (en)
- Malay (ms)
- Chinese Simplified (zh-Hans)
- Chinese Traditional (zh-Hant)
- Russian (ru)
- Indonesian (id)
- Arabic (ar)
- Japanese (ja)
- Korean (ko)
### Dataset Statistics
- Total videos: 3
- Total chunks: 496
- Max chunk duration: 30s
- Overlap: 0.5s
- Audio: Embedded (self-contained)
## Source
Created from YouTube videos with dialogue separation using DialogueSidon model.
## License
CC-BY-4.0
提供机构:
khursanirevo



