anyreach-ai/dualturn-otospeech-turn-taking
收藏Hugging Face2026-04-01 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/anyreach-ai/dualturn-otospeech-turn-taking
下载链接
链接失效反馈官方服务:
资源简介:
---
license: other
task_categories:
- audio-classification
language:
- en
tags:
- turn-taking
- conversation
- speech
- mimi
- vad
pretty_name: OtoSpeech Turn-Taking
---
# OtoSpeech Turn-Taking
Processed version of the OtoSpeech corpus with per-frame turn-taking labels and Mimi speech codec features. Each row is one full conversation.
## Splits
| Split | Sessions |
|-------|----------|
| train | 900 |
| val | 112 |
| test | 113 |
80/10/10 split, seed=42.
## Features
| Column | Shape | dtype | Description |
|--------|-------|-------|-------------|
| `session_id` | — | str | Unique session identifier |
| `dataset` | — | str | Source corpus name |
| `duration_s` | — | float | Conversation duration (seconds) |
| `codes_ch0` | [T, 8] | int | Mimi RVQ codes, speaker 0 |
| `codes_ch1` | [T, 8] | int | Mimi RVQ codes, speaker 1 |
| `mimi_feat_ch0` | [T, 512] | float | Mimi continuous embeddings, speaker 0 |
| `mimi_feat_ch1` | [T, 512] | float | Mimi continuous embeddings, speaker 1 |
| `vad_ch0` | [T] | float | Voice activity (0/1), speaker 0 |
| `vad_ch1` | [T] | float | Voice activity (0/1), speaker 1 |
| `eot_ch0` | [T] | int | End-of-Turn label, speaker 0 |
| `eot_ch1` | [T] | int | End-of-Turn label, speaker 1 |
| `hold_ch0` | [T] | int | Hold (no handover) label, speaker 0 |
| `hold_ch1` | [T] | int | Hold (no handover) label, speaker 1 |
| `bot_ch0` | [T] | int | Beginning-of-Turn label, speaker 0 |
| `bot_ch1` | [T] | int | Beginning-of-Turn label, speaker 1 |
| `bc_ch0` | [T] | int | Backchannel label, speaker 0 |
| `bc_ch1` | [T] | int | Backchannel label, speaker 1 |
| `fvad_ch0` | [T, 4] | float | Fine-grained VAD logits (4 heads), speaker 0 |
| `fvad_ch1` | [T, 4] | float | Fine-grained VAD logits (4 heads), speaker 1 |
**Frame rate:** 12.5 Hz — 1 frame = 80 ms.
Event labels (eot, hold, bot, bc) are sparse binary: 0 everywhere except at event frames.
## Splits file
`splits.json` in the repo root maps every session ID to its split. Useful for
reproducing the split or processing the raw audio yourself:
```python
from huggingface_hub import hf_hub_download
import json
path = hf_hub_download("anyreach-ai/dualturn-otospeech-turn-taking", "splits.json", repo_type="dataset")
with open(path) as f:
splits = json.load(f)
print(splits["split_counts"])
# e.g. {'train': 900, 'val': 112, 'test': 113}
```
## Loading
```python
import numpy as np
from datasets import load_dataset
ds = load_dataset("anyreach-ai/dualturn-otospeech-turn-taking")
session = ds["val"][0]
T = session["num_frames"]
# 2D arrays are stored flat — reshape to recover original shape
codes = np.array(session["codes_ch0"]).reshape(T, 8) # (T, 8) int
feats = np.array(session["mimi_feat_ch0"]).reshape(T, 512) # (T, 512) float
fvad = np.array(session["fvad_ch0"]).reshape(T, 4) # (T, 4) float
# 1D arrays — use directly
vad = np.array(session["vad_ch0"]) # (T,) float
eot = np.array(session["eot_ch0"]) # (T,) int
```
## PyTorch windowed loader
```python
import numpy as np
import torch
from torch.utils.data import DataLoader
from datasets import load_dataset
LABEL_KEYS = ["eot", "hold", "bot", "bc"]
def collate_windows(sessions, window_frames=125, hop_frames=25):
"""Slice each session into fixed-length windows and collate into a batch."""
windows = []
for s in sessions:
T = s["num_frames"]
codes = np.array(s["codes_ch0"]).reshape(T, 8)
for start in range(0, T - window_frames + 1, hop_frames):
end = start + window_frames
w = {
"codes_ch0": torch.tensor(np.array(s["codes_ch0"]).reshape(T, 8)[start:end], dtype=torch.long),
"codes_ch1": torch.tensor(np.array(s["codes_ch1"]).reshape(T, 8)[start:end], dtype=torch.long),
"vad_ch0": torch.tensor(np.array(s["vad_ch0"])[start:end], dtype=torch.float),
"vad_ch1": torch.tensor(np.array(s["vad_ch1"])[start:end], dtype=torch.float),
}
for name in LABEL_KEYS:
for ch in ["ch0", "ch1"]:
key = f"{name}_{ch}"
w[key] = torch.tensor(np.array(s[key])[start:end], dtype=torch.float)
windows.append(w)
return {k: torch.stack([w[k] for w in windows]) for k in windows[0]}
ds = load_dataset("anyreach-ai/dualturn-otospeech-turn-taking")
loader = DataLoader(ds["train"], batch_size=8, shuffle=True,
collate_fn=lambda b: collate_windows(b, window_frames=125, hop_frames=25))
batch = next(iter(loader))
print(batch["codes_ch0"].shape) # [N_windows, 125, 8]
print(batch["eot_ch0"].shape) # [N_windows, 125]
```
## Label definitions
| Label | Meaning |
|-------|---------|
| **EOT** | End-of-Turn: speaker yields the floor |
| **HOLD** | Speaker keeps the floor (no handover) |
| **BOT** | Beginning-of-Turn: other speaker takes the floor |
| **BC** | Backchannel: short acknowledgement, no floor claim |
| **VAD** | Voice Activity Detection (1 = speech) |
## DualTurn Model & Code
The following will be released soon:
- **Trained model checkpoint** — on HuggingFace at [anyreach-ai](https://huggingface.co/anyreach-ai)
- **Training code** — model architecture, training loop, and configs
- **Evaluation code** — benchmarks and metrics used in the paper
## Authors
- [Shangeth Rajaa](https://github.com/shangeth) — Senior ML Research Scientist, Anyreach AI
## Citation
This dataset was used for training and evaluation in the **DualTurn** paper (submitted to Interspeech 2026).
`splits.json` contains the exact train/val/test splits from the official dataset used for all experiments in the paper.
**Paper:** [DualTurn: Learning Turn-Taking from Dual-Channel Generative Speech Pretraining](https://arxiv.org/abs/2603.08216)
```bibtex
@misc{rajaa2026dualturnlearningturntakingdualchannel,
title={DualTurn: Learning Turn-Taking from Dual-Channel Generative Speech Pretraining},
author={Shangeth Rajaa},
year={2026},
eprint={2603.08216},
archivePrefix={arXiv},
primaryClass={eess.AS},
url={https://arxiv.org/abs/2603.08216},
}
```
If you use this dataset, please cite - [otoearth/otoSpeech-full-duplex-280h](https://huggingface.co/datasets/otoearth/otoSpeech-full-duplex-280h):
```bibtex
@misc{otoSpeech-full-duplex-280h,
title = {otoSpeech-full-duplex-280h: Full-Duplex Conversational Speech Dataset},
author = {otoearth},
year = {2025},
howpublished = {\url{https://huggingface.co/datasets/otoearth/otoSpeech-full-duplex-280h}},
note = {License: CC BY 4.0}
}
```
---
许可证:其他
任务类别:
- 音频分类(audio-classification)
语言:
- 英语
标签:
- 轮次交互(turn-taking)
- 对话
- 语音
- mimi
- 语音活动检测(VAD)
展示名称:OtoSpeech 轮次交互
---
# OtoSpeech 轮次交互数据集
本数据集为OtoSpeech语料库的处理版本,包含逐帧轮次交互标签与Mimi语音编解码器特征。每一行对应一段完整对话。
## 数据集拆分
| 拆分方式 | 会话数 |
|-------|----------|
| 训练集 | 900 |
| 验证集 | 112 |
| 测试集 | 113 |
数据集按80/10/10的比例拆分,随机种子设为42。
## 数据集特征
| 列名 | 形状 | 数据类型 | 描述 |
|--------|-------|-------|-------------|
| `session_id` | — | str | 唯一会话标识符 |
| `dataset` | — | str | 源语料库名称 |
| `duration_s` | — | float | 对话时长(秒) |
| `codes_ch0` | [T, 8] | int | 说话者0的Mimi RVQ编码 |
| `codes_ch1` | [T, 8] | int | 说话者1的Mimi RVQ编码 |
| `mimi_feat_ch0` | [T, 512] | float | 说话者0的Mimi连续嵌入特征 |
| `mimi_feat_ch1` | [T, 512] | float | 说话者1的Mimi连续嵌入特征 |
| `vad_ch0` | [T] | float | 说话者0的语音活动检测标签(0/1) |
| `vad_ch1` | [T] | float | 说话者1的语音活动检测标签(0/1) |
| `eot_ch0` | [T] | int | 说话者0的轮次结束标签 |
| `eot_ch1` | [T] | int | 说话者1的轮次结束标签 |
| `hold_ch0` | [T] | int | 说话者0的保持话语权标签(无交接) |
| `hold_ch1` | [T] | int | 说话者1的保持话语权标签(无交接) |
| `bot_ch0` | [T] | int | 说话者0的轮次开始标签 |
| `bot_ch1` | [T] | int | 说话者1的轮次开始标签 |
| `bc_ch0` | [T] | int | 说话者0的反向通道标签 |
| `bc_ch1` | [T] | int | 说话者1的反向通道标签 |
| `fvad_ch0` | [T, 4] | float | 说话者0的细粒度语音活动检测logits(4个注意力头) |
| `fvad_ch1` | [T, 4] | float | 说话者1的细粒度语音活动检测logits(4个注意力头) |
**帧率:** 12.5 Hz,即每帧对应80毫秒。
事件标签(eot、hold、bot、bc)为稀疏二值标签:仅在事件对应帧处取值为1,其余帧均为0。
## 拆分文件
仓库根目录下的`splits.json`文件可将所有会话ID映射至其所属拆分,便于复现数据集拆分或自行处理原始音频:
python
from huggingface_hub import hf_hub_download
import json
# 从HuggingFace数据集仓库下载splits.json文件
path = hf_hub_download("anyreach-ai/dualturn-otospeech-turn-taking", "splits.json", repo_type="dataset")
with open(path) as f:
splits = json.load(f)
print(splits["split_counts"])
# 示例输出:{'train': 900, 'val': 112, 'test': 113}
## 数据集加载
python
import numpy as np
from datasets import load_dataset
# 加载数据集
ds = load_dataset("anyreach-ai/dualturn-otospeech-turn-taking")
# 获取验证集的第一个会话
session = ds["val"][0]
T = session["num_frames"]
# 二维数组以扁平格式存储,需重塑以恢复原始形状
codes = np.array(session["codes_ch0"]).reshape(T, 8) # 形状:(T, 8),数据类型:int
feats = np.array(session["mimi_feat_ch0"]).reshape(T, 512) # 形状:(T, 512),数据类型:float
fvad = np.array(session["fvad_ch0"]).reshape(T, 4) # 形状:(T, 4),数据类型:float
# 一维数组可直接使用
vad = np.array(session["vad_ch0"]) # 形状:(T,),数据类型:float
eot = np.array(session["eot_ch0"]) # 形状:(T,),数据类型:int
## PyTorch窗口化加载器
python
import numpy as np
import torch
from torch.utils.data import DataLoader
from datasets import load_dataset
# 定义需要处理的标签键
LABEL_KEYS = ["eot", "hold", "bot", "bc"]
def collate_windows(sessions, window_frames=125, hop_frames=25):
# 将每个会话切割为固定长度的窗口并整理为批次
windows = []
for s in sessions:
T = s["num_frames"]
for start in range(0, T - window_frames + 1, hop_frames):
end = start + window_frames
w = {
# 加载说话者0和1的Mimi RVQ编码
"codes_ch0": torch.tensor(np.array(s["codes_ch0"]).reshape(T, 8)[start:end], dtype=torch.long),
"codes_ch1": torch.tensor(np.array(s["codes_ch1"]).reshape(T, 8)[start:end], dtype=torch.long),
# 加载说话者0和1的语音活动检测标签
"vad_ch0": torch.tensor(np.array(s["vad_ch0"])[start:end], dtype=torch.float),
"vad_ch1": torch.tensor(np.array(s["vad_ch1"])[start:end], dtype=torch.float),
}
for name in LABEL_KEYS:
for ch in ["ch0", "ch1"]:
key = f"{name}_{ch}"
w[key] = torch.tensor(np.array(s[key])[start:end], dtype=torch.float)
windows.append(w)
# 将所有窗口整理为批次张量
return {k: torch.stack([w[k] for w in windows]) for k in windows[0]}
# 加载数据集
ds = load_dataset("anyreach-ai/dualturn-otospeech-turn-taking")
# 创建数据加载器
loader = DataLoader(ds["train"], batch_size=8, shuffle=True,
collate_fn=lambda b: collate_windows(b, window_frames=125, hop_frames=25))
# 获取一个批次的数据
batch = next(iter(loader))
# 打印张量形状
print(batch["codes_ch0"].shape) # 形状:[N_windows, 125, 8]
print(batch["eot_ch0"].shape) # 形状:[N_windows, 125]
## 标签定义
| 标签 | 含义 |
|-------|---------|
| **EOT** | 轮次结束(End-of-Turn):发言者让出话语权 |
| **HOLD** | 保持话语权(HOLD):发言者保留话语权,无交接行为 |
| **BOT** | 轮次开始(Beginning-of-Turn):另一发言者获取话语权 |
| **BC** | 反向通道(Backchannel):简短回应,未主张话语权 |
| **VAD** | 语音活动检测(Voice Activity Detection,1表示存在语音) |
## DualTurn模型与代码
以下内容将于近期发布:
- **训练好的模型权重文件(checkpoint)**:已上传至HuggingFace平台的[anyreach-ai](https://huggingface.co/anyreach-ai)仓库
- **训练代码**:包含模型架构、训练循环与配置文件
- **评估代码**:包含论文中使用的基准测试与评价指标
## 作者
- [Shangeth Rajaa](https://github.com/shangeth):Anyreach AI高级机器学习研究科学家
## 引用说明
本数据集已用于**DualTurn**论文的训练与评估工作(已提交至Interspeech 2026)。
`splits.json`包含了论文中所有实验所用官方数据集的精确训练/验证/测试拆分方式。
**论文链接**:[DualTurn: 基于双通道生成式语音预训练的轮次交互学习](https://arxiv.org/abs/2603.08216)
bibtex
@misc{rajaa2026dualturnlearningturntakingdualchannel,
title={DualTurn: Learning Turn-Taking from Dual-Channel Generative Speech Pretraining},
author={Shangeth Rajaa},
year={2026},
eprint={2603.08216},
archivePrefix={arXiv},
primaryClass={eess.AS},
url={https://arxiv.org/abs/2603.08216},
}
若您使用本数据集,请引用以下数据集:[otoearth/otoSpeech-full-duplex-280h](https://huggingface.co/datasets/otoearth/otoSpeech-full-duplex-280h)
bibtex
@misc{otoSpeech-full-duplex-280h,
title = {otoSpeech-full-duplex-280h: Full-Duplex Conversational Speech Dataset},
author = {otoearth},
year = {2025},
howpublished = {url{https://huggingface.co/datasets/otoearth/otoSpeech-full-duplex-280h}},
note = {License: CC BY 4.0}
}
提供机构:
anyreach-ai



