hiraki/seamless-interact-canary-transcripts
收藏Hugging Face2026-03-24 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/hiraki/seamless-interact-canary-transcripts
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: cc-by-4.0
task_categories:
- automatic-speech-recognition
tags:
- speech
- transcription
- conversation
- turn-taking
- seamless
- canary
size_categories:
- 1M<n<10M
dataset_info:
features:
- name: audio_filename
dtype: large_string
- name: offset
dtype: float64
- name: duration
dtype: float64
- name: text
dtype: large_string
- name: model
dtype: large_string
- name: volume
dtype: large_string
- name: session
dtype: large_string
- name: interaction
dtype: large_string
- name: participant
dtype: large_string
- name: segment_id
dtype: large_string
- name: split
dtype: large_string
- name: silence_after_ms
dtype: float64
- name: prev_same_speaker
dtype: large_string
- name: next_same_speaker
dtype: large_string
- name: prev_other_speaker
dtype: large_string
- name: next_other_speaker
dtype: large_string
- name: llm_label
dtype: large_string
- name: llm_confidence
dtype: float64
- name: llm_rationale
dtype: large_string
- name: llm_parse_strategy
dtype: large_string
- name: tt_label
dtype: large_string
- name: tt_confidence
dtype: float64
- name: final_tag
dtype: large_string
- name: accepted
dtype: 'null'
- name: accept_reason
dtype: large_string
- name: reject_reason
dtype: large_string
splits:
- name: train
num_bytes: 1672719890
num_examples: 2781985
download_size: 431528233
dataset_size: 1672719890
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
---
# Seamless Interact - Canary Transcripts
Speech transcription dataset generated by running **NVIDIA Canary-Qwen2.5B** ASR model on the [Seamless Interact](https://huggingface.co/datasets/facebook/seamless_interact) conversational speech dataset, segmented by original corpus boundaries.
## Dataset Description
This dataset contains **2,781,985 transcribed speech segments** from the Seamless Interact corpus. Each segment includes the transcribed text, timing information (offset and duration within the source audio), and metadata identifying the source conversation.
### Source
- **Audio corpus**: [Seamless Interact](https://huggingface.co/datasets/facebook/seamless_interact) (English conversational speech)
- **ASR model**: NVIDIA Canary-Qwen2.5B (`canary-qwen-2.5b`)
- **Segmentation**: Original corpus segmentation boundaries
## Dataset Structure
| Column | Type | Description |
|--------|------|-------------|
| `audio_filename` | string | Source WAV filename (e.g., `V00_S0030_I00000125_P0045.wav`) |
| `offset` | float | Start time of the segment within the audio file (seconds) |
| `duration` | float | Duration of the segment (seconds) |
| `text` | string | Transcribed text |
| `model` | string | ASR model used (`canary-qwen-2.5b`) |
| `volume` | string | Volume ID (e.g., `V00`) |
| `session` | string | Session ID (e.g., `S0030`) |
| `interaction` | string | Interaction ID (e.g., `I00000125`) |
| `participant` | string | Participant ID (e.g., `P0045`) |
### Example
```python
from datasets import load_dataset
ds = load_dataset("hiraki/seamless-interact-canary-transcripts", split="train")
print(ds[0])
# {'audio_filename': 'V00_S0030_I00000125_P0045.wav',
# 'offset': 0.0,
# 'duration': 3.119,
# 'text': 'James what is your happiest childhood memory',
# 'model': 'canary-qwen-2.5b',
# 'volume': 'V00',
# 'session': 'S0030',
# 'interaction': 'I00000125',
# 'participant': 'P0045'}
```
## Statistics
- **Total segments**: 2,781,985
- **Source files**: 93,129 JSONL files
- **Format**: Parquet (94.3 MB)
## Intended Use
This dataset is intended for research on:
- Conversational speech recognition
- Turn-taking prediction and modeling
- Dialogue systems
## License
This dataset follows the license of the source Seamless Interact corpus (CC-BY-4.0).
提供机构:
hiraki



