five

Netdrum/marsar_vhf

收藏
Hugging Face2026-04-09 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/Netdrum/marsar_vhf
下载链接
链接失效反馈
官方服务:
资源简介:
--- pretty_name: MARSAR VHF task_categories: - automatic-speech-recognition language: - en license: cc-by-4.0 size_categories: - 1K<n<10K configs: - config_name: default data_files: - split: train path: train.csv - split: test path: test.csv dataset_info: features: - name: audio dtype: string - name: text dtype: string - name: audio_duration_seconds dtype: float64 - name: audio_duration_error dtype: string tags: - Maritime, - SAR, - VHF, --- # Netdrum/marsar_vhf Whisper fine-tuning dataset prepared from VHF transcription CSV exports. --- ## Research Context This dataset was developed as part of a Professional Doctorate in Engineering (PDEng) research project conducted in collaboration between the Irish Coast Guard (IRCG) and the University of Limerick. The research investigates the application of domain-adapted Automatic Speech Recognition (ASR) systems to maritime Very High Frequency (VHF) radio communications used in Search and Rescue (SAR) operations. Within Maritime Rescue Coordination Centres (MRCCs), watch officers must process high volumes of real-time voice communications under time-critical and cognitively demanding conditions. This dataset supports the development and evaluation of AI-driven systems designed to enhance situational awareness, reduce operator workload, and improve the accuracy and timeliness of incident logging. It has been curated to reflect key characteristics of maritime VHF communications, including short-duration transmissions, non-standard phraseology, domain-specific terminology (e.g., vessel names and geographic locations), and variable audio quality representative of operational environments. --- ## Dataset Creation This dataset comprises simulated maritime VHF radio communications developed with the assistance of volunteer participants. Contributors recorded a range of VHF-style transmissions designed to reflect realistic communication patterns encountered during Search and Rescue (SAR) operations. The dataset includes both clean audio samples and augmented noisy recordings. Noise profiles such as background interference and radio static were introduced to replicate the acoustic conditions typical of VHF radio channels in operational environments. The primary objective of this dataset is to approximate real-world radio communication scenarios experienced by Irish Coast Guard Maritime Rescue Coordination Centres (MRCCs), while ensuring no real operational communications are disclosed. --- ## Research Objectives The primary objectives of this dataset are: - To support the fine-tuning and evaluation of transformer-based ASR models (e.g., Whisper) for maritime VHF communication. - To provide a domain-specific benchmark for speech recognition performance under noisy, real-world communication conditions. - To enable investigation of Word Error Rate (WER) and domain-specific error patterns in maritime SAR communications. - To facilitate research into downstream tasks such as Named Entity Recognition (NER) for extracting key operational information (e.g., vessel names, positions, distress types). - To contribute to the development of integrated decision-support systems within MRCC environments by enabling real-time transcription of radio communications. - To explore the impact of dataset curation strategies (e.g., normalization, filtering, segmentation) on ASR performance in low-resource, high-noise domains. This dataset forms part of a broader research effort to design and evaluate AI-enabled communication pipelines for maritime emergency response systems. --- ## Splits - `train.csv`: 1970 rows. Training dataset. - `test.csv`: 219 rows. Held-out test dataset. --- ## Columns - `audio`: Publicly accessible URL to audio file. - `text`: Normalized transcription used as ground truth. - `audio_duration_seconds`: Duration of audio clip in seconds. - `audio_duration_error`: Duration lookup error (empty for valid samples). --- ## Normalization The transcription text has been normalized using the following steps: - Lowercasing applied to all text - Punctuation removed - Whitespace standardised --- ## Source Policy - `clean_dataset_with_valid_urls.hf.csv` and `labelstudio_tasks_normalized.hf.csv` were used to generate the training and test splits. - `live_vhf_data.hf.csv` is reserved as a validation dataset and is not included in the published dataset to ensure unbiased evaluation. --- ## Duration Summary - `train.csv`: 17,425.744 seconds (~4.84 hours) - `test.csv`: 1,850.363 seconds (~0.51 hours) --- ## Notes - The published dataset contains only the train and test splits. - A separate validation dataset exists but is intentionally excluded from the public release to support controlled experimental evaluation. - Only audio files explicitly referenced in the dataset are included; no additional audio samples from storage are used. - License and reuse terms should be reviewed and updated on the Hugging Face Hub if a more specific license is required.
提供机构:
Netdrum
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作