five

hiraki/seamless-interact-canary-transcripts

收藏
Hugging Face2026-03-24 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/hiraki/seamless-interact-canary-transcripts
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en license: cc-by-4.0 task_categories: - automatic-speech-recognition tags: - speech - transcription - conversation - turn-taking - seamless - canary size_categories: - 1M<n<10M dataset_info: features: - name: audio_filename dtype: large_string - name: offset dtype: float64 - name: duration dtype: float64 - name: text dtype: large_string - name: model dtype: large_string - name: volume dtype: large_string - name: session dtype: large_string - name: interaction dtype: large_string - name: participant dtype: large_string - name: segment_id dtype: large_string - name: split dtype: large_string - name: silence_after_ms dtype: float64 - name: prev_same_speaker dtype: large_string - name: next_same_speaker dtype: large_string - name: prev_other_speaker dtype: large_string - name: next_other_speaker dtype: large_string - name: llm_label dtype: large_string - name: llm_confidence dtype: float64 - name: llm_rationale dtype: large_string - name: llm_parse_strategy dtype: large_string - name: tt_label dtype: large_string - name: tt_confidence dtype: float64 - name: final_tag dtype: large_string - name: accepted dtype: 'null' - name: accept_reason dtype: large_string - name: reject_reason dtype: large_string splits: - name: train num_bytes: 1672719890 num_examples: 2781985 download_size: 431528233 dataset_size: 1672719890 configs: - config_name: default data_files: - split: train path: data/train-* --- # Seamless Interact - Canary Transcripts Speech transcription dataset generated by running **NVIDIA Canary-Qwen2.5B** ASR model on the [Seamless Interact](https://huggingface.co/datasets/facebook/seamless_interact) conversational speech dataset, segmented by original corpus boundaries. ## Dataset Description This dataset contains **2,781,985 transcribed speech segments** from the Seamless Interact corpus. Each segment includes the transcribed text, timing information (offset and duration within the source audio), and metadata identifying the source conversation. ### Source - **Audio corpus**: [Seamless Interact](https://huggingface.co/datasets/facebook/seamless_interact) (English conversational speech) - **ASR model**: NVIDIA Canary-Qwen2.5B (`canary-qwen-2.5b`) - **Segmentation**: Original corpus segmentation boundaries ## Dataset Structure | Column | Type | Description | |--------|------|-------------| | `audio_filename` | string | Source WAV filename (e.g., `V00_S0030_I00000125_P0045.wav`) | | `offset` | float | Start time of the segment within the audio file (seconds) | | `duration` | float | Duration of the segment (seconds) | | `text` | string | Transcribed text | | `model` | string | ASR model used (`canary-qwen-2.5b`) | | `volume` | string | Volume ID (e.g., `V00`) | | `session` | string | Session ID (e.g., `S0030`) | | `interaction` | string | Interaction ID (e.g., `I00000125`) | | `participant` | string | Participant ID (e.g., `P0045`) | ### Example ```python from datasets import load_dataset ds = load_dataset("hiraki/seamless-interact-canary-transcripts", split="train") print(ds[0]) # {'audio_filename': 'V00_S0030_I00000125_P0045.wav', # 'offset': 0.0, # 'duration': 3.119, # 'text': 'James what is your happiest childhood memory', # 'model': 'canary-qwen-2.5b', # 'volume': 'V00', # 'session': 'S0030', # 'interaction': 'I00000125', # 'participant': 'P0045'} ``` ## Statistics - **Total segments**: 2,781,985 - **Source files**: 93,129 JSONL files - **Format**: Parquet (94.3 MB) ## Intended Use This dataset is intended for research on: - Conversational speech recognition - Turn-taking prediction and modeling - Dialogue systems ## License This dataset follows the license of the source Seamless Interact corpus (CC-BY-4.0).
提供机构:
hiraki
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作