hiraki/candor-turntaking-annotations
收藏Hugging Face2026-03-25 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/hiraki/candor-turntaking-annotations
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: cc-by-4.0
task_categories:
- automatic-speech-recognition
tags:
- speech
- transcription
- conversation
- turn-taking
- candor
- canary
size_categories:
- 100K<n<1M
dataset_info:
features:
- name: audio_filename
dtype: string
- name: offset
dtype: float64
- name: duration
dtype: float64
- name: segment_id
dtype: string
- name: conversation_id
dtype: string
- name: channel
dtype: string
- name: text
dtype: string
- name: model
dtype: string
- name: alignment_mean_prob
dtype: float64
- name: tt_label
dtype: string
- name: tt_confidence
dtype: float64
- name: llm_label
dtype: string
- name: llm_confidence
dtype: float64
- name: llm_rationale
dtype: string
- name: final_tag
dtype: string
- name: accepted
dtype: bool
- name: accept_reason
dtype: string
- name: reject_reason
dtype: string
splits:
- name: train
num_examples: 172591
download_size: 12900000
dataset_size: 172591
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
---
# CANDOR - Turn-Taking Annotations
Speech transcription and turn-taking annotation dataset built from the [CANDOR corpus](https://github.com/CANDORcorpus/candor-corpus) using **NVIDIA Canary-Qwen2.5B** ASR.
## Dataset Description
This dataset contains **172,591 transcribed speech segments** from the CANDOR conversational speech corpus (1,656 conversations). Each segment is a per-speaker utterance with Canary ASR transcript, designed for turn-taking prediction research.
### Source
- **Audio corpus**: [CANDOR](https://github.com/CANDORcorpus/candor-corpus) (English conversational speech, 1,656 conversations)
- **ASR model**: NVIDIA Canary-Qwen2.5B (`canary-qwen-2.5b`)
- **Audio format**: Per-speaker mono WAV (16kHz), extracted from stereo MP3
## Dataset Structure
| Column | Type | Description |
|--------|------|-------------|
| `audio_filename` | string | Per-speaker WAV filename (e.g., `{uuid}_L.wav`) |
| `offset` | float | Start time within the audio file (seconds) |
| `duration` | float | Duration of the segment (seconds) |
| `segment_id` | string | Unique segment identifier |
| `conversation_id` | string | CANDOR conversation UUID |
| `channel` | string | Speaker channel (`L` or `R`) |
| `text` | string | Canary ASR transcript |
| `model` | string | ASR model (`canary-qwen-2.5b`) |
| `alignment_mean_prob` | float | Parakeet CTC forced alignment score (to be added) |
| `tt_label` | string | TEN turn-taking label: finished/unfinished/wait (to be added) |
| `tt_confidence` | float | TEN confidence (3-way softmax) (to be added) |
| `llm_label` | string | LLM label: COMPLETE/INCOMPLETE/BACKCHANNEL (to be added) |
| `llm_confidence` | float | LLM confidence (to be added) |
| `llm_rationale` | string | LLM reasoning (to be added) |
| `final_tag` | string | Cross-annotation consensus label (to be added) |
| `accepted` | bool | Whether segment passed all quality gates (to be added) |
| `accept_reason` | string | Reason for acceptance (to be added) |
| `reject_reason` | string | Reason for rejection (to be added) |
### Annotation Pipeline (in progress)
Annotations are being added incrementally:
1. **Parakeet CTC forced alignment** → `alignment_mean_prob`
2. **TEN turn-taking model** → `tt_label`, `tt_confidence`
3. **LLM annotation (Qwen2.5-32B)** → `llm_label`, `llm_confidence`, `llm_rationale`
4. **Cross-annotation** → `final_tag`, `accepted` (requires tt_confidence >= 0.9, alignment_mean_prob >= 0.9)
## Statistics
- **Total segments**: 172,591
- **Conversations**: 1,656
- **Segments with text**: 165,925 (96.1%)
- **Mean duration**: 7.6s
## Intended Use
This dataset is intended for research on:
- Turn-taking prediction and modeling
- Conversational speech recognition
- Backchannel detection
## Related Datasets
- [hiraki/seamless-interact-canary-transcripts](https://huggingface.co/datasets/hiraki/seamless-interact-canary-transcripts) — Same pipeline applied to Seamless Interact corpus (2.78M segments)
## License
CC-BY-4.0 (following the CANDOR corpus license).
提供机构:
hiraki



