anyreach-ai/dualturn-switchboard-turn-taking
收藏Hugging Face2026-04-01 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/anyreach-ai/dualturn-switchboard-turn-taking
下载链接
链接失效反馈官方服务:
资源简介:
---
license: other
task_categories:
- audio-classification
language:
- en
tags:
- turn-taking
- conversation
- speech
- mimi
- vad
pretty_name: Switchboard Turn-Taking
configs:
- config_name: default
data_files:
- split: train
path: data/train-*.parquet
- split: val
path: data/val-*.parquet
- split: test
path: data/test-*.parquet
---
# Switchboard Turn-Taking
Processed version of the Switchboard corpus with per-frame turn-taking labels and Mimi speech codec features. Each row is one full conversation.
## Splits
| Split | Sessions |
|-------|----------|
| train | 2000 |
| val | 300 |
| test | 138 |
Standard Switchboard split.
## Features
| Column | Shape | dtype | Description |
|--------|-------|-------|-------------|
| `session_id` | — | str | Unique session identifier |
| `dataset` | — | str | Source corpus name |
| `duration_s` | — | float | Conversation duration (seconds) |
| `codes_ch0` | [T, 8] | int | Mimi RVQ codes, speaker 0 |
| `codes_ch1` | [T, 8] | int | Mimi RVQ codes, speaker 1 |
| `mimi_feat_ch0` | [T, 512] | float | Mimi continuous embeddings, speaker 0 |
| `mimi_feat_ch1` | [T, 512] | float | Mimi continuous embeddings, speaker 1 |
| `vad_ch0` | [T] | float | Voice activity (0/1), speaker 0 |
| `vad_ch1` | [T] | float | Voice activity (0/1), speaker 1 |
| `eot_ch0` | [T] | int | End-of-Turn label, speaker 0 |
| `eot_ch1` | [T] | int | End-of-Turn label, speaker 1 |
| `hold_ch0` | [T] | int | Hold (no handover) label, speaker 0 |
| `hold_ch1` | [T] | int | Hold (no handover) label, speaker 1 |
| `bot_ch0` | [T] | int | Beginning-of-Turn label, speaker 0 |
| `bot_ch1` | [T] | int | Beginning-of-Turn label, speaker 1 |
| `bc_ch0` | [T] | int | Backchannel label, speaker 0 |
| `bc_ch1` | [T] | int | Backchannel label, speaker 1 |
| `fvad_ch0` | [T, 4] | float | Fine-grained VAD logits (4 heads), speaker 0 |
| `fvad_ch1` | [T, 4] | float | Fine-grained VAD logits (4 heads), speaker 1 |
**Frame rate:** 12.5 Hz — 1 frame = 80 ms.
Event labels (eot, hold, bot, bc) are sparse binary: 0 everywhere except at event frames.
## Splits file
`splits.json` in the repo root maps every session ID to its split. Useful for
reproducing the split or processing the raw audio yourself:
```python
from huggingface_hub import hf_hub_download
import json
path = hf_hub_download("anyreach-ai/dualturn-switchboard-turn-taking", "splits.json", repo_type="dataset")
with open(path) as f:
splits = json.load(f)
print(splits["split_counts"])
# e.g. {'train': 900, 'val': 112, 'test': 113}
```
## Loading
```python
import numpy as np
from datasets import load_dataset
ds = load_dataset("anyreach-ai/dualturn-switchboard-turn-taking")
session = ds["val"][0]
T = session["num_frames"]
# 2D arrays are stored flat — reshape to recover original shape
codes = np.array(session["codes_ch0"]).reshape(T, 8) # (T, 8) int
feats = np.array(session["mimi_feat_ch0"]).reshape(T, 512) # (T, 512) float
fvad = np.array(session["fvad_ch0"]).reshape(T, 4) # (T, 4) float
# 1D arrays — use directly
vad = np.array(session["vad_ch0"]) # (T,) float
eot = np.array(session["eot_ch0"]) # (T,) int
```
## PyTorch windowed loader
```python
import numpy as np
import torch
from torch.utils.data import DataLoader
from datasets import load_dataset
LABEL_KEYS = ["eot", "hold", "bot", "bc"]
def collate_windows(sessions, window_frames=125, hop_frames=25):
"""Slice each session into fixed-length windows and collate into a batch."""
windows = []
for s in sessions:
T = s["num_frames"]
codes = np.array(s["codes_ch0"]).reshape(T, 8)
for start in range(0, T - window_frames + 1, hop_frames):
end = start + window_frames
w = {
"codes_ch0": torch.tensor(np.array(s["codes_ch0"]).reshape(T, 8)[start:end], dtype=torch.long),
"codes_ch1": torch.tensor(np.array(s["codes_ch1"]).reshape(T, 8)[start:end], dtype=torch.long),
"vad_ch0": torch.tensor(np.array(s["vad_ch0"])[start:end], dtype=torch.float),
"vad_ch1": torch.tensor(np.array(s["vad_ch1"])[start:end], dtype=torch.float),
}
for name in LABEL_KEYS:
for ch in ["ch0", "ch1"]:
key = f"{name}_{ch}"
w[key] = torch.tensor(np.array(s[key])[start:end], dtype=torch.float)
windows.append(w)
return {k: torch.stack([w[k] for w in windows]) for k in windows[0]}
ds = load_dataset("anyreach-ai/dualturn-switchboard-turn-taking")
loader = DataLoader(ds["train"], batch_size=8, shuffle=True,
collate_fn=lambda b: collate_windows(b, window_frames=125, hop_frames=25))
batch = next(iter(loader))
print(batch["codes_ch0"].shape) # [N_windows, 125, 8]
print(batch["eot_ch0"].shape) # [N_windows, 125]
```
## Label definitions
| Label | Meaning |
|-------|---------|
| **EOT** | End-of-Turn: speaker yields the floor |
| **HOLD** | Speaker keeps the floor (no handover) |
| **BOT** | Beginning-of-Turn: other speaker takes the floor |
| **BC** | Backchannel: short acknowledgement, no floor claim |
| **VAD** | Voice Activity Detection (1 = speech) |
## DualTurn Model & Code
The following will be released soon:
- **Trained model checkpoint** — on HuggingFace at [anyreach-ai](https://huggingface.co/anyreach-ai)
- **Training code** — model architecture, training loop, and configs
- **Evaluation code** — benchmarks and metrics used in the paper
## Authors
- [Shangeth Rajaa](https://github.com/shangeth) — Senior ML Research Scientist, Anyreach AI
## Citation
This dataset was used for training and evaluation in the **DualTurn** paper (submitted to Interspeech 2026).
`splits.json` contains the exact train/val/test splits from the official dataset used for all experiments in the paper.
**Paper:** [DualTurn: Learning Turn-Taking from Dual-Channel Generative Speech Pretraining](https://arxiv.org/abs/2603.08216)
```bibtex
@misc{rajaa2026dualturnlearningturntakingdualchannel,
title={DualTurn: Learning Turn-Taking from Dual-Channel Generative Speech Pretraining},
author={Shangeth Rajaa},
year={2026},
eprint={2603.08216},
archivePrefix={arXiv},
primaryClass={eess.AS},
url={https://arxiv.org/abs/2603.08216},
}
```
If you use this dataset, please cite - [cgpotts/swda](https://huggingface.co/datasets/cgpotts/swda).
提供机构:
anyreach-ai



