hiraki/seamless-interact-canary-transcripts

Name: hiraki/seamless-interact-canary-transcripts
Creator: hiraki
Published: 2026-03-24 22:17:50
License: 暂无描述

Hugging Face2026-03-24 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/hiraki/seamless-interact-canary-transcripts

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en license: cc-by-4.0 task_categories: - automatic-speech-recognition tags: - speech - transcription - conversation - turn-taking - seamless - canary size_categories: - 1M<n<10M dataset_info: features: - name: audio_filename dtype: large_string - name: offset dtype: float64 - name: duration dtype: float64 - name: text dtype: large_string - name: model dtype: large_string - name: volume dtype: large_string - name: session dtype: large_string - name: interaction dtype: large_string - name: participant dtype: large_string - name: segment_id dtype: large_string - name: split dtype: large_string - name: silence_after_ms dtype: float64 - name: prev_same_speaker dtype: large_string - name: next_same_speaker dtype: large_string - name: prev_other_speaker dtype: large_string - name: next_other_speaker dtype: large_string - name: llm_label dtype: large_string - name: llm_confidence dtype: float64 - name: llm_rationale dtype: large_string - name: llm_parse_strategy dtype: large_string - name: tt_label dtype: large_string - name: tt_confidence dtype: float64 - name: final_tag dtype: large_string - name: accepted dtype: 'null' - name: accept_reason dtype: large_string - name: reject_reason dtype: large_string splits: - name: train num_bytes: 1672719890 num_examples: 2781985 download_size: 431528233 dataset_size: 1672719890 configs: - config_name: default data_files: - split: train path: data/train-* --- # Seamless Interact - Canary Transcripts Speech transcription dataset generated by running **NVIDIA Canary-Qwen2.5B** ASR model on the [Seamless Interact](https://huggingface.co/datasets/facebook/seamless_interact) conversational speech dataset, segmented by original corpus boundaries. ## Dataset Description This dataset contains **2,781,985 transcribed speech segments** from the Seamless Interact corpus. Each segment includes the transcribed text, timing information (offset and duration within the source audio), and metadata identifying the source conversation. ### Source - **Audio corpus**: [Seamless Interact](https://huggingface.co/datasets/facebook/seamless_interact) (English conversational speech) - **ASR model**: NVIDIA Canary-Qwen2.5B (`canary-qwen-2.5b`) - **Segmentation**: Original corpus segmentation boundaries ## Dataset Structure | Column | Type | Description | |--------|------|-------------| | `audio_filename` | string | Source WAV filename (e.g., `V00_S0030_I00000125_P0045.wav`) | | `offset` | float | Start time of the segment within the audio file (seconds) | | `duration` | float | Duration of the segment (seconds) | | `text` | string | Transcribed text | | `model` | string | ASR model used (`canary-qwen-2.5b`) | | `volume` | string | Volume ID (e.g., `V00`) | | `session` | string | Session ID (e.g., `S0030`) | | `interaction` | string | Interaction ID (e.g., `I00000125`) | | `participant` | string | Participant ID (e.g., `P0045`) | ### Example ```python from datasets import load_dataset ds = load_dataset("hiraki/seamless-interact-canary-transcripts", split="train") print(ds[0]) # {'audio_filename': 'V00_S0030_I00000125_P0045.wav', # 'offset': 0.0, # 'duration': 3.119, # 'text': 'James what is your happiest childhood memory', # 'model': 'canary-qwen-2.5b', # 'volume': 'V00', # 'session': 'S0030', # 'interaction': 'I00000125', # 'participant': 'P0045'} ``` ## Statistics - **Total segments**: 2,781,985 - **Source files**: 93,129 JSONL files - **Format**: Parquet (94.3 MB) ## Intended Use This dataset is intended for research on: - Conversational speech recognition - Turn-taking prediction and modeling - Dialogue systems ## License This dataset follows the license of the source Seamless Interact corpus (CC-BY-4.0).

提供机构：

hiraki

5,000+

优质数据集

54 个

任务类型

进入经典数据集