khursanirevo/multiturn_ks
收藏Hugging Face2026-04-09 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/khursanirevo/multiturn_ks
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
task_categories:
- automatic-speech-recognition
language:
- en
- ms
- zh
- ru
- id
- ar
- ja
- ko
multilinguality:
- highly_multilingual
size_categories:
- 10K<n<100K
---
# khursanirevo/multiturn_ks
## Dataset Description
Multiturn dialogue dataset with speaker-separated stereo audio and multi-language transcripts from 139 YouTube videos.
### Features
- **Audio**: Stereo audio with speaker separation (speaker 0 = left channel, speaker 1 = right channel)
- **Segments**: Speaker turn-level annotations with timestamps for English and Malay
- **Multi-language**: Transcripts in 9 languages (en, ms, zh-Hans, zh-Hant, ru, id, ar, ja, ko)
- **Video ID**: YouTube video identifier for each chunk
- **Chunking**: 30-second chunks with 0.5s overlap
### Columns
- `audio`: Playable stereo audio (24kHz)
- `video_id`: YouTube video identifier
- `sentence`: Full transcript for the chunk (English)
- `segments_en`: JSON list of English speaker turns `[{speaker, start, end, text}]`
- `segments_ms`: JSON list of Malay speaker turns `[{speaker, start, end, text}]`
- `total_speakers`: Number of speakers in chunk (typically 2)
- `sentence_ms`, `sentence_en`, etc.: Transcripts in each language
### Usage
```python
from datasets import load_dataset
import json
dataset = load_dataset("khursanirevo/multiturn_ks")
# Access audio and segments
chunk = dataset[0]
audio = chunk["audio"] # Stereo audio array
video_id = chunk["video_id"] # YouTube video ID
segments_en = json.loads(chunk["segments_en"]) # English speaker turns
segments_ms = json.loads(chunk["segments_ms"]) # Malay speaker turns
print(f"From video: {video_id}")
for seg in segments_en:
speaker = seg['speaker']
text = seg['text']
print(f"Speaker {speaker} (EN): {text}")
for seg in segments_ms:
speaker = seg['speaker']
text = seg['text']
print(f"Speaker {speaker} (MS): {text}")
```
### Audio Format
- **Format**: WAV (PCM)
- **Sample rate**: 24kHz
- **Channels**: 2 (stereo, speaker separation)
- **Bit depth**: 32-bit float
### Speaker Detection
Speakers are detected using RMS energy analysis:
- Channel 0 (left): Speaker 0
- Channel 1 (right): Speaker 1
### Languages
Supported languages:
- English (en)
- Malay (ms)
- Chinese Simplified (zh-Hans)
- Chinese Traditional (zh-Hant)
- Russian (ru)
- Indonesian (id)
- Arabic (ar)
- Japanese (ja)
- Korean (ko)
### Dataset Statistics
- Total videos: 139
- Total chunks: 26383
- Max chunk duration: 30s
- Overlap: 0.5s
## Source
Created from YouTube videos with dialogue separation using DialogueSidon model.
## License
CC-BY-4.0
提供机构:
khursanirevo



