five

khursanirevo/multiturn_ks_embedded

收藏
Hugging Face2026-04-09 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/khursanirevo/multiturn_ks_embedded
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-4.0 task_categories: - automatic-speech-recognition language: - en - ms - zh - ru - id - ar - ja - ko multilinguality: - highly_multilingual size_categories: - 10K<n<100K --- # khursanirevo/multiturn_ks_embedded ## Dataset Description Multiturn dialogue dataset with **embedded audio** and multi-language transcripts from 3 YouTube videos. ### Features - **Audio**: Embedded stereo audio (WAV format, bytes embedded directly in dataset) - **Segments**: Speaker turn-level annotations with timestamps for English and Malay - **Multi-language**: Transcripts in 9 languages (en, ms, zh-Hans, zh-Hant, ru, id, ar, ja, ko) - **Video ID**: YouTube video identifier for each chunk - **Chunking**: 30-second chunks with 0.5s overlap - **Self-contained**: No external audio files needed ### Columns - `audio`: Embedded stereo audio as bytes (WAV, 24kHz) - `video_id`: YouTube video identifier - `sentence`: Full transcript for the chunk (English) - `segments_en`: JSON list of English speaker turns with speaker, start, end, text fields - `segments_ms`: JSON list of Malay speaker turns with speaker, start, end, text fields - `total_speakers`: Number of speakers in chunk (typically 2) - `sentence_ms`, `sentence_en`, etc.: Transcripts in each language ### Usage ```python from datasets import load_dataset import json import io import soundfile as sf # Load dataset dataset = load_dataset("khursanirevo/multiturn_ks_embedded") # Access a chunk chunk = dataset[0] # Load embedded audio audio_bytes = chunk["audio"] buffer = io.BytesIO(audio_bytes) audio, sample_rate = sf.read(buffer) print(f"Audio shape: {audio.shape}") print(f"Sample rate: {sample_rate}") print(f"Duration: {len(audio)/sample_rate:.1f}s") # Access speaker turns video_id = chunk["video_id"] segments_en = json.loads(chunk["segments_en"]) segments_ms = json.loads(chunk["segments_ms"]) print(f"From video: {video_id}") print(f"\nEnglish segments:") for seg in segments_en[:3]: speaker = seg['speaker'] start = seg['start'] end = seg['end'] text = seg['text'][:60] print(f" Speaker {speaker} ({start}s-{end}s): {text}...") ``` ### Audio Format Audio is embedded as WAV bytes in the dataset: - **Format**: WAV (PCM) - **Sample rate**: 24kHz - **Channels**: 2 (stereo, speaker separation) - **Bit depth**: 32-bit float - **Size**: ~2-2.5MB per 30-second chunk ### Speaker Detection Speakers are detected using RMS energy analysis: - Channel 0 (left): Speaker 0 - Channel 1 (right): Speaker 1 ### Languages Supported languages: - English (en) - Malay (ms) - Chinese Simplified (zh-Hans) - Chinese Traditional (zh-Hant) - Russian (ru) - Indonesian (id) - Arabic (ar) - Japanese (ja) - Korean (ko) ### Dataset Statistics - Total videos: 3 - Total chunks: 496 - Max chunk duration: 30s - Overlap: 0.5s - Audio: Embedded (self-contained) ## Source Created from YouTube videos with dialogue separation using DialogueSidon model. ## License CC-BY-4.0
提供机构:
khursanirevo
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作