anyreach-ai/dualturn-otospeech-turn-taking

Name: anyreach-ai/dualturn-otospeech-turn-taking
Creator: anyreach-ai
Published: 2026-04-01 09:48:09
License: 暂无描述

Hugging Face2026-04-01 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/anyreach-ai/dualturn-otospeech-turn-taking

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: other task_categories: - audio-classification language: - en tags: - turn-taking - conversation - speech - mimi - vad pretty_name: OtoSpeech Turn-Taking --- # OtoSpeech Turn-Taking Processed version of the OtoSpeech corpus with per-frame turn-taking labels and Mimi speech codec features. Each row is one full conversation. ## Splits | Split | Sessions | |-------|----------| | train | 900 | | val | 112 | | test | 113 | 80/10/10 split, seed=42. ## Features | Column | Shape | dtype | Description | |--------|-------|-------|-------------| | `session_id` | — | str | Unique session identifier | | `dataset` | — | str | Source corpus name | | `duration_s` | — | float | Conversation duration (seconds) | | `codes_ch0` | [T, 8] | int | Mimi RVQ codes, speaker 0 | | `codes_ch1` | [T, 8] | int | Mimi RVQ codes, speaker 1 | | `mimi_feat_ch0` | [T, 512] | float | Mimi continuous embeddings, speaker 0 | | `mimi_feat_ch1` | [T, 512] | float | Mimi continuous embeddings, speaker 1 | | `vad_ch0` | [T] | float | Voice activity (0/1), speaker 0 | | `vad_ch1` | [T] | float | Voice activity (0/1), speaker 1 | | `eot_ch0` | [T] | int | End-of-Turn label, speaker 0 | | `eot_ch1` | [T] | int | End-of-Turn label, speaker 1 | | `hold_ch0` | [T] | int | Hold (no handover) label, speaker 0 | | `hold_ch1` | [T] | int | Hold (no handover) label, speaker 1 | | `bot_ch0` | [T] | int | Beginning-of-Turn label, speaker 0 | | `bot_ch1` | [T] | int | Beginning-of-Turn label, speaker 1 | | `bc_ch0` | [T] | int | Backchannel label, speaker 0 | | `bc_ch1` | [T] | int | Backchannel label, speaker 1 | | `fvad_ch0` | [T, 4] | float | Fine-grained VAD logits (4 heads), speaker 0 | | `fvad_ch1` | [T, 4] | float | Fine-grained VAD logits (4 heads), speaker 1 | **Frame rate:** 12.5 Hz — 1 frame = 80 ms. Event labels (eot, hold, bot, bc) are sparse binary: 0 everywhere except at event frames. ## Splits file `splits.json` in the repo root maps every session ID to its split. Useful for reproducing the split or processing the raw audio yourself: ```python from huggingface_hub import hf_hub_download import json path = hf_hub_download("anyreach-ai/dualturn-otospeech-turn-taking", "splits.json", repo_type="dataset") with open(path) as f: splits = json.load(f) print(splits["split_counts"]) # e.g. {'train': 900, 'val': 112, 'test': 113} ``` ## Loading ```python import numpy as np from datasets import load_dataset ds = load_dataset("anyreach-ai/dualturn-otospeech-turn-taking") session = ds["val"][0] T = session["num_frames"] # 2D arrays are stored flat — reshape to recover original shape codes = np.array(session["codes_ch0"]).reshape(T, 8) # (T, 8) int feats = np.array(session["mimi_feat_ch0"]).reshape(T, 512) # (T, 512) float fvad = np.array(session["fvad_ch0"]).reshape(T, 4) # (T, 4) float # 1D arrays — use directly vad = np.array(session["vad_ch0"]) # (T,) float eot = np.array(session["eot_ch0"]) # (T,) int ``` ## PyTorch windowed loader ```python import numpy as np import torch from torch.utils.data import DataLoader from datasets import load_dataset LABEL_KEYS = ["eot", "hold", "bot", "bc"] def collate_windows(sessions, window_frames=125, hop_frames=25): """Slice each session into fixed-length windows and collate into a batch.""" windows = [] for s in sessions: T = s["num_frames"] codes = np.array(s["codes_ch0"]).reshape(T, 8) for start in range(0, T - window_frames + 1, hop_frames): end = start + window_frames w = { "codes_ch0": torch.tensor(np.array(s["codes_ch0"]).reshape(T, 8)[start:end], dtype=torch.long), "codes_ch1": torch.tensor(np.array(s["codes_ch1"]).reshape(T, 8)[start:end], dtype=torch.long), "vad_ch0": torch.tensor(np.array(s["vad_ch0"])[start:end], dtype=torch.float), "vad_ch1": torch.tensor(np.array(s["vad_ch1"])[start:end], dtype=torch.float), } for name in LABEL_KEYS: for ch in ["ch0", "ch1"]: key = f"{name}_{ch}" w[key] = torch.tensor(np.array(s[key])[start:end], dtype=torch.float) windows.append(w) return {k: torch.stack([w[k] for w in windows]) for k in windows[0]} ds = load_dataset("anyreach-ai/dualturn-otospeech-turn-taking") loader = DataLoader(ds["train"], batch_size=8, shuffle=True, collate_fn=lambda b: collate_windows(b, window_frames=125, hop_frames=25)) batch = next(iter(loader)) print(batch["codes_ch0"].shape) # [N_windows, 125, 8] print(batch["eot_ch0"].shape) # [N_windows, 125] ``` ## Label definitions | Label | Meaning | |-------|---------| | **EOT** | End-of-Turn: speaker yields the floor | | **HOLD** | Speaker keeps the floor (no handover) | | **BOT** | Beginning-of-Turn: other speaker takes the floor | | **BC** | Backchannel: short acknowledgement, no floor claim | | **VAD** | Voice Activity Detection (1 = speech) | ## DualTurn Model & Code The following will be released soon: - **Trained model checkpoint** — on HuggingFace at [anyreach-ai](https://huggingface.co/anyreach-ai) - **Training code** — model architecture, training loop, and configs - **Evaluation code** — benchmarks and metrics used in the paper ## Authors - [Shangeth Rajaa](https://github.com/shangeth) — Senior ML Research Scientist, Anyreach AI ## Citation This dataset was used for training and evaluation in the **DualTurn** paper (submitted to Interspeech 2026). `splits.json` contains the exact train/val/test splits from the official dataset used for all experiments in the paper. **Paper:** [DualTurn: Learning Turn-Taking from Dual-Channel Generative Speech Pretraining](https://arxiv.org/abs/2603.08216) ```bibtex @misc{rajaa2026dualturnlearningturntakingdualchannel, title={DualTurn: Learning Turn-Taking from Dual-Channel Generative Speech Pretraining}, author={Shangeth Rajaa}, year={2026}, eprint={2603.08216}, archivePrefix={arXiv}, primaryClass={eess.AS}, url={https://arxiv.org/abs/2603.08216}, } ``` If you use this dataset, please cite - [otoearth/otoSpeech-full-duplex-280h](https://huggingface.co/datasets/otoearth/otoSpeech-full-duplex-280h): ```bibtex @misc{otoSpeech-full-duplex-280h, title = {otoSpeech-full-duplex-280h: Full-Duplex Conversational Speech Dataset}, author = {otoearth}, year = {2025}, howpublished = {\url{https://huggingface.co/datasets/otoearth/otoSpeech-full-duplex-280h}}, note = {License: CC BY 4.0} } ```

--- 许可证：其他任务类别： - 音频分类（audio-classification）语言： - 英语标签： - 轮次交互（turn-taking） - 对话 - 语音 - mimi - 语音活动检测（VAD）展示名称：OtoSpeech 轮次交互 --- # OtoSpeech 轮次交互数据集本数据集为OtoSpeech语料库的处理版本，包含逐帧轮次交互标签与Mimi语音编解码器特征。每一行对应一段完整对话。 ## 数据集拆分 | 拆分方式 | 会话数 | |-------|----------| | 训练集 | 900 | | 验证集 | 112 | | 测试集 | 113 | 数据集按80/10/10的比例拆分，随机种子设为42。 ## 数据集特征 | 列名 | 形状 | 数据类型 | 描述 | |--------|-------|-------|-------------| | `session_id` | — | str | 唯一会话标识符 | | `dataset` | — | str | 源语料库名称 | | `duration_s` | — | float | 对话时长（秒） | | `codes_ch0` | [T, 8] | int | 说话者0的Mimi RVQ编码 | | `codes_ch1` | [T, 8] | int | 说话者1的Mimi RVQ编码 | | `mimi_feat_ch0` | [T, 512] | float | 说话者0的Mimi连续嵌入特征 | | `mimi_feat_ch1` | [T, 512] | float | 说话者1的Mimi连续嵌入特征 | | `vad_ch0` | [T] | float | 说话者0的语音活动检测标签（0/1） | | `vad_ch1` | [T] | float | 说话者1的语音活动检测标签（0/1） | | `eot_ch0` | [T] | int | 说话者0的轮次结束标签 | | `eot_ch1` | [T] | int | 说话者1的轮次结束标签 | | `hold_ch0` | [T] | int | 说话者0的保持话语权标签（无交接） | | `hold_ch1` | [T] | int | 说话者1的保持话语权标签（无交接） | | `bot_ch0` | [T] | int | 说话者0的轮次开始标签 | | `bot_ch1` | [T] | int | 说话者1的轮次开始标签 | | `bc_ch0` | [T] | int | 说话者0的反向通道标签 | | `bc_ch1` | [T] | int | 说话者1的反向通道标签 | | `fvad_ch0` | [T, 4] | float | 说话者0的细粒度语音活动检测logits（4个注意力头） | | `fvad_ch1` | [T, 4] | float | 说话者1的细粒度语音活动检测logits（4个注意力头） | **帧率：** 12.5 Hz，即每帧对应80毫秒。事件标签（eot、hold、bot、bc）为稀疏二值标签：仅在事件对应帧处取值为1，其余帧均为0。 ## 拆分文件仓库根目录下的`splits.json`文件可将所有会话ID映射至其所属拆分，便于复现数据集拆分或自行处理原始音频： python from huggingface_hub import hf_hub_download import json # 从HuggingFace数据集仓库下载splits.json文件 path = hf_hub_download("anyreach-ai/dualturn-otospeech-turn-taking", "splits.json", repo_type="dataset") with open(path) as f: splits = json.load(f) print(splits["split_counts"]) # 示例输出：{'train': 900, 'val': 112, 'test': 113} ## 数据集加载 python import numpy as np from datasets import load_dataset # 加载数据集 ds = load_dataset("anyreach-ai/dualturn-otospeech-turn-taking") # 获取验证集的第一个会话 session = ds["val"][0] T = session["num_frames"] # 二维数组以扁平格式存储，需重塑以恢复原始形状 codes = np.array(session["codes_ch0"]).reshape(T, 8) # 形状：(T, 8)，数据类型：int feats = np.array(session["mimi_feat_ch0"]).reshape(T, 512) # 形状：(T, 512)，数据类型：float fvad = np.array(session["fvad_ch0"]).reshape(T, 4) # 形状：(T, 4)，数据类型：float # 一维数组可直接使用 vad = np.array(session["vad_ch0"]) # 形状：(T,)，数据类型：float eot = np.array(session["eot_ch0"]) # 形状：(T,)，数据类型：int ## PyTorch窗口化加载器 python import numpy as np import torch from torch.utils.data import DataLoader from datasets import load_dataset # 定义需要处理的标签键 LABEL_KEYS = ["eot", "hold", "bot", "bc"] def collate_windows(sessions, window_frames=125, hop_frames=25): # 将每个会话切割为固定长度的窗口并整理为批次 windows = [] for s in sessions: T = s["num_frames"] for start in range(0, T - window_frames + 1, hop_frames): end = start + window_frames w = { # 加载说话者0和1的Mimi RVQ编码 "codes_ch0": torch.tensor(np.array(s["codes_ch0"]).reshape(T, 8)[start:end], dtype=torch.long), "codes_ch1": torch.tensor(np.array(s["codes_ch1"]).reshape(T, 8)[start:end], dtype=torch.long), # 加载说话者0和1的语音活动检测标签 "vad_ch0": torch.tensor(np.array(s["vad_ch0"])[start:end], dtype=torch.float), "vad_ch1": torch.tensor(np.array(s["vad_ch1"])[start:end], dtype=torch.float), } for name in LABEL_KEYS: for ch in ["ch0", "ch1"]: key = f"{name}_{ch}" w[key] = torch.tensor(np.array(s[key])[start:end], dtype=torch.float) windows.append(w) # 将所有窗口整理为批次张量 return {k: torch.stack([w[k] for w in windows]) for k in windows[0]} # 加载数据集 ds = load_dataset("anyreach-ai/dualturn-otospeech-turn-taking") # 创建数据加载器 loader = DataLoader(ds["train"], batch_size=8, shuffle=True, collate_fn=lambda b: collate_windows(b, window_frames=125, hop_frames=25)) # 获取一个批次的数据 batch = next(iter(loader)) # 打印张量形状 print(batch["codes_ch0"].shape) # 形状：[N_windows, 125, 8] print(batch["eot_ch0"].shape) # 形状：[N_windows, 125] ## 标签定义 | 标签 | 含义 | |-------|---------| | **EOT** | 轮次结束（End-of-Turn）：发言者让出话语权 | | **HOLD** | 保持话语权（HOLD）：发言者保留话语权，无交接行为 | | **BOT** | 轮次开始（Beginning-of-Turn）：另一发言者获取话语权 | | **BC** | 反向通道（Backchannel）：简短回应，未主张话语权 | | **VAD** | 语音活动检测（Voice Activity Detection，1表示存在语音） | ## DualTurn模型与代码以下内容将于近期发布： - **训练好的模型权重文件（checkpoint）**：已上传至HuggingFace平台的[anyreach-ai](https://huggingface.co/anyreach-ai)仓库 - **训练代码**：包含模型架构、训练循环与配置文件 - **评估代码**：包含论文中使用的基准测试与评价指标 ## 作者 - [Shangeth Rajaa](https://github.com/shangeth)：Anyreach AI高级机器学习研究科学家 ## 引用说明本数据集已用于**DualTurn**论文的训练与评估工作（已提交至Interspeech 2026）。 `splits.json`包含了论文中所有实验所用官方数据集的精确训练/验证/测试拆分方式。 **论文链接**：[DualTurn: 基于双通道生成式语音预训练的轮次交互学习](https://arxiv.org/abs/2603.08216) bibtex @misc{rajaa2026dualturnlearningturntakingdualchannel, title={DualTurn: Learning Turn-Taking from Dual-Channel Generative Speech Pretraining}, author={Shangeth Rajaa}, year={2026}, eprint={2603.08216}, archivePrefix={arXiv}, primaryClass={eess.AS}, url={https://arxiv.org/abs/2603.08216}, } 若您使用本数据集，请引用以下数据集：[otoearth/otoSpeech-full-duplex-280h](https://huggingface.co/datasets/otoearth/otoSpeech-full-duplex-280h) bibtex @misc{otoSpeech-full-duplex-280h, title = {otoSpeech-full-duplex-280h: Full-Duplex Conversational Speech Dataset}, author = {otoearth}, year = {2025}, howpublished = {url{https://huggingface.co/datasets/otoearth/otoSpeech-full-duplex-280h}}, note = {License: CC BY 4.0} }

提供机构：

anyreach-ai

5,000+

优质数据集

54 个

任务类型

进入经典数据集