anyreach-ai/dualturn-switchboard-turn-taking

Name: anyreach-ai/dualturn-switchboard-turn-taking
Creator: anyreach-ai
Published: 2026-04-01 09:52:46
License: 暂无描述

Hugging Face2026-04-01 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/anyreach-ai/dualturn-switchboard-turn-taking

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: other task_categories: - audio-classification language: - en tags: - turn-taking - conversation - speech - mimi - vad pretty_name: Switchboard Turn-Taking configs: - config_name: default data_files: - split: train path: data/train-*.parquet - split: val path: data/val-*.parquet - split: test path: data/test-*.parquet --- # Switchboard Turn-Taking Processed version of the Switchboard corpus with per-frame turn-taking labels and Mimi speech codec features. Each row is one full conversation. ## Splits | Split | Sessions | |-------|----------| | train | 2000 | | val | 300 | | test | 138 | Standard Switchboard split. ## Features | Column | Shape | dtype | Description | |--------|-------|-------|-------------| | `session_id` | — | str | Unique session identifier | | `dataset` | — | str | Source corpus name | | `duration_s` | — | float | Conversation duration (seconds) | | `codes_ch0` | [T, 8] | int | Mimi RVQ codes, speaker 0 | | `codes_ch1` | [T, 8] | int | Mimi RVQ codes, speaker 1 | | `mimi_feat_ch0` | [T, 512] | float | Mimi continuous embeddings, speaker 0 | | `mimi_feat_ch1` | [T, 512] | float | Mimi continuous embeddings, speaker 1 | | `vad_ch0` | [T] | float | Voice activity (0/1), speaker 0 | | `vad_ch1` | [T] | float | Voice activity (0/1), speaker 1 | | `eot_ch0` | [T] | int | End-of-Turn label, speaker 0 | | `eot_ch1` | [T] | int | End-of-Turn label, speaker 1 | | `hold_ch0` | [T] | int | Hold (no handover) label, speaker 0 | | `hold_ch1` | [T] | int | Hold (no handover) label, speaker 1 | | `bot_ch0` | [T] | int | Beginning-of-Turn label, speaker 0 | | `bot_ch1` | [T] | int | Beginning-of-Turn label, speaker 1 | | `bc_ch0` | [T] | int | Backchannel label, speaker 0 | | `bc_ch1` | [T] | int | Backchannel label, speaker 1 | | `fvad_ch0` | [T, 4] | float | Fine-grained VAD logits (4 heads), speaker 0 | | `fvad_ch1` | [T, 4] | float | Fine-grained VAD logits (4 heads), speaker 1 | **Frame rate:** 12.5 Hz — 1 frame = 80 ms. Event labels (eot, hold, bot, bc) are sparse binary: 0 everywhere except at event frames. ## Splits file `splits.json` in the repo root maps every session ID to its split. Useful for reproducing the split or processing the raw audio yourself: ```python from huggingface_hub import hf_hub_download import json path = hf_hub_download("anyreach-ai/dualturn-switchboard-turn-taking", "splits.json", repo_type="dataset") with open(path) as f: splits = json.load(f) print(splits["split_counts"]) # e.g. {'train': 900, 'val': 112, 'test': 113} ``` ## Loading ```python import numpy as np from datasets import load_dataset ds = load_dataset("anyreach-ai/dualturn-switchboard-turn-taking") session = ds["val"][0] T = session["num_frames"] # 2D arrays are stored flat — reshape to recover original shape codes = np.array(session["codes_ch0"]).reshape(T, 8) # (T, 8) int feats = np.array(session["mimi_feat_ch0"]).reshape(T, 512) # (T, 512) float fvad = np.array(session["fvad_ch0"]).reshape(T, 4) # (T, 4) float # 1D arrays — use directly vad = np.array(session["vad_ch0"]) # (T,) float eot = np.array(session["eot_ch0"]) # (T,) int ``` ## PyTorch windowed loader ```python import numpy as np import torch from torch.utils.data import DataLoader from datasets import load_dataset LABEL_KEYS = ["eot", "hold", "bot", "bc"] def collate_windows(sessions, window_frames=125, hop_frames=25): """Slice each session into fixed-length windows and collate into a batch.""" windows = [] for s in sessions: T = s["num_frames"] codes = np.array(s["codes_ch0"]).reshape(T, 8) for start in range(0, T - window_frames + 1, hop_frames): end = start + window_frames w = { "codes_ch0": torch.tensor(np.array(s["codes_ch0"]).reshape(T, 8)[start:end], dtype=torch.long), "codes_ch1": torch.tensor(np.array(s["codes_ch1"]).reshape(T, 8)[start:end], dtype=torch.long), "vad_ch0": torch.tensor(np.array(s["vad_ch0"])[start:end], dtype=torch.float), "vad_ch1": torch.tensor(np.array(s["vad_ch1"])[start:end], dtype=torch.float), } for name in LABEL_KEYS: for ch in ["ch0", "ch1"]: key = f"{name}_{ch}" w[key] = torch.tensor(np.array(s[key])[start:end], dtype=torch.float) windows.append(w) return {k: torch.stack([w[k] for w in windows]) for k in windows[0]} ds = load_dataset("anyreach-ai/dualturn-switchboard-turn-taking") loader = DataLoader(ds["train"], batch_size=8, shuffle=True, collate_fn=lambda b: collate_windows(b, window_frames=125, hop_frames=25)) batch = next(iter(loader)) print(batch["codes_ch0"].shape) # [N_windows, 125, 8] print(batch["eot_ch0"].shape) # [N_windows, 125] ``` ## Label definitions | Label | Meaning | |-------|---------| | **EOT** | End-of-Turn: speaker yields the floor | | **HOLD** | Speaker keeps the floor (no handover) | | **BOT** | Beginning-of-Turn: other speaker takes the floor | | **BC** | Backchannel: short acknowledgement, no floor claim | | **VAD** | Voice Activity Detection (1 = speech) | ## DualTurn Model & Code The following will be released soon: - **Trained model checkpoint** — on HuggingFace at [anyreach-ai](https://huggingface.co/anyreach-ai) - **Training code** — model architecture, training loop, and configs - **Evaluation code** — benchmarks and metrics used in the paper ## Authors - [Shangeth Rajaa](https://github.com/shangeth) — Senior ML Research Scientist, Anyreach AI ## Citation This dataset was used for training and evaluation in the **DualTurn** paper (submitted to Interspeech 2026). `splits.json` contains the exact train/val/test splits from the official dataset used for all experiments in the paper. **Paper:** [DualTurn: Learning Turn-Taking from Dual-Channel Generative Speech Pretraining](https://arxiv.org/abs/2603.08216) ```bibtex @misc{rajaa2026dualturnlearningturntakingdualchannel, title={DualTurn: Learning Turn-Taking from Dual-Channel Generative Speech Pretraining}, author={Shangeth Rajaa}, year={2026}, eprint={2603.08216}, archivePrefix={arXiv}, primaryClass={eess.AS}, url={https://arxiv.org/abs/2603.08216}, } ``` If you use this dataset, please cite - [cgpotts/swda](https://huggingface.co/datasets/cgpotts/swda).

提供机构：

anyreach-ai

5,000+

优质数据集

54 个

任务类型

进入经典数据集