hiraki/candor-turntaking-annotations

Name: hiraki/candor-turntaking-annotations
Creator: hiraki
Published: 2026-03-25 03:41:21
License: 暂无描述

Hugging Face2026-03-25 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/hiraki/candor-turntaking-annotations

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en license: cc-by-4.0 task_categories: - automatic-speech-recognition tags: - speech - transcription - conversation - turn-taking - candor - canary size_categories: - 100K<n<1M dataset_info: features: - name: audio_filename dtype: string - name: offset dtype: float64 - name: duration dtype: float64 - name: segment_id dtype: string - name: conversation_id dtype: string - name: channel dtype: string - name: text dtype: string - name: model dtype: string - name: alignment_mean_prob dtype: float64 - name: tt_label dtype: string - name: tt_confidence dtype: float64 - name: llm_label dtype: string - name: llm_confidence dtype: float64 - name: llm_rationale dtype: string - name: final_tag dtype: string - name: accepted dtype: bool - name: accept_reason dtype: string - name: reject_reason dtype: string splits: - name: train num_examples: 172591 download_size: 12900000 dataset_size: 172591 configs: - config_name: default data_files: - split: train path: data/train-* --- # CANDOR - Turn-Taking Annotations Speech transcription and turn-taking annotation dataset built from the [CANDOR corpus](https://github.com/CANDORcorpus/candor-corpus) using **NVIDIA Canary-Qwen2.5B** ASR. ## Dataset Description This dataset contains **172,591 transcribed speech segments** from the CANDOR conversational speech corpus (1,656 conversations). Each segment is a per-speaker utterance with Canary ASR transcript, designed for turn-taking prediction research. ### Source - **Audio corpus**: [CANDOR](https://github.com/CANDORcorpus/candor-corpus) (English conversational speech, 1,656 conversations) - **ASR model**: NVIDIA Canary-Qwen2.5B (`canary-qwen-2.5b`) - **Audio format**: Per-speaker mono WAV (16kHz), extracted from stereo MP3 ## Dataset Structure | Column | Type | Description | |--------|------|-------------| | `audio_filename` | string | Per-speaker WAV filename (e.g., `{uuid}_L.wav`) | | `offset` | float | Start time within the audio file (seconds) | | `duration` | float | Duration of the segment (seconds) | | `segment_id` | string | Unique segment identifier | | `conversation_id` | string | CANDOR conversation UUID | | `channel` | string | Speaker channel (`L` or `R`) | | `text` | string | Canary ASR transcript | | `model` | string | ASR model (`canary-qwen-2.5b`) | | `alignment_mean_prob` | float | Parakeet CTC forced alignment score (to be added) | | `tt_label` | string | TEN turn-taking label: finished/unfinished/wait (to be added) | | `tt_confidence` | float | TEN confidence (3-way softmax) (to be added) | | `llm_label` | string | LLM label: COMPLETE/INCOMPLETE/BACKCHANNEL (to be added) | | `llm_confidence` | float | LLM confidence (to be added) | | `llm_rationale` | string | LLM reasoning (to be added) | | `final_tag` | string | Cross-annotation consensus label (to be added) | | `accepted` | bool | Whether segment passed all quality gates (to be added) | | `accept_reason` | string | Reason for acceptance (to be added) | | `reject_reason` | string | Reason for rejection (to be added) | ### Annotation Pipeline (in progress) Annotations are being added incrementally: 1. **Parakeet CTC forced alignment** → `alignment_mean_prob` 2. **TEN turn-taking model** → `tt_label`, `tt_confidence` 3. **LLM annotation (Qwen2.5-32B)** → `llm_label`, `llm_confidence`, `llm_rationale` 4. **Cross-annotation** → `final_tag`, `accepted` (requires tt_confidence >= 0.9, alignment_mean_prob >= 0.9) ## Statistics - **Total segments**: 172,591 - **Conversations**: 1,656 - **Segments with text**: 165,925 (96.1%) - **Mean duration**: 7.6s ## Intended Use This dataset is intended for research on: - Turn-taking prediction and modeling - Conversational speech recognition - Backchannel detection ## Related Datasets - [hiraki/seamless-interact-canary-transcripts](https://huggingface.co/datasets/hiraki/seamless-interact-canary-transcripts) — Same pipeline applied to Seamless Interact corpus (2.78M segments) ## License CC-BY-4.0 (following the CANDOR corpus license).

提供机构：

hiraki

5,000+

优质数据集

54 个

任务类型

进入经典数据集