nccratliri/vad-zebra-finch

Name: nccratliri/vad-zebra-finch
Creator: nccratliri
Published: 2023-10-03 07:12:09
License: 暂无描述

Hugging Face2023-10-03 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/nccratliri/vad-zebra-finch

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 --- # Positive Transfer Of The Whisper Speech Transformer To Human And Animal Voice Activity Detection We proposed WhisperSeg, utilizing the Whisper Transformer pre-trained for Automatic Speech Recognition (ASR) for both human and animal Voice Activity Detection (VAD). For more details, please refer to our paper > > [**Positive Transfer of the Whisper Speech Transformer to Human and Animal Voice Activity Detection**](https://doi.org/10.1101/2023.09.30.560270) > > Nianlong Gu, Kanghwi Lee, Maris Basha, Sumit Kumar Ram, Guanghao You, Richard H. R. Hahnloser <br> > University of Zurich and ETH Zurich This is the Zebra finch dataset customized for Animal Voice Activity Detection (vocal segmentation) in WhisperSeg. ## Download Dataset ```python from huggingface_hub import snapshot_download snapshot_download('nccratliri/vad-zebra-finch', local_dir = "data/zebra-finch", repo_type="dataset" ) ``` For more usage details, please refer to the GitHub repository: https://github.com/nianlonggu/WhisperSeg When using this dataset, please also cite: ``` @article {Tomka2023.09.04.555475, author = {Tomas Tomka and Xinyu Hao and Aoxue Miao and Kanghwi Lee and Maris Basha and Stefan Reimann and Anja T Zai and Richard Hahnloser}, title = {Benchmarking nearest neighbor retrieval of zebra finch vocalizations across development}, elocation-id = {2023.09.04.555475}, year = {2023}, doi = {10.1101/2023.09.04.555475}, publisher = {Cold Spring Harbor Laboratory}, abstract = {Vocalizations are highly specialized motor gestures that regulate social interactions. The reliable detection of vocalizations from raw streams of microphone data remains an open problem even in research on widely studied animals such as the zebra finch. A promising method for finding vocal samples from potentially few labelled examples(templates) is nearest neighbor retrieval, but this method has never been extensively tested on vocal segmentation tasks. We retrieve zebra finch vocalizations as neighbors of each other in the sound spectrogram space. Based on merely 50 templates, we find excellent retrieval performance in adults (F1 score of 0.93 +/- 0.07) but not in juveniles (F1 score of 0.64 +/- 0.18), presumably due to the larger vocal variability of the latter. The performance in juveniles improves when retrieval is based on fixed-size template slices (F1 score of 0.72 +/- 0.10) instead of entire templates. Among the several distance metrics we tested such as the cosine and the Euclidean distance, we find that the Spearman distance largely outperforms all others. We release our expert-curated dataset of more than 50{\textquoteright}000 zebra finch vocal segments, which will enable training of data-hungry machine-learning approaches.Competing Interest StatementThe authors have declared no competing interest.}, URL = {https://www.biorxiv.org/content/early/2023/09/04/2023.09.04.555475}, eprint = {https://www.biorxiv.org/content/early/2023/09/04/2023.09.04.555475.full.pdf}, journal = {bioRxiv} } ``` ``` @article {Gu2023.09.30.560270, author = {Nianlong Gu and Kanghwi Lee and Maris Basha and Sumit Kumar Ram and Guanghao You and Richard Hahnloser}, title = {Positive Transfer of the Whisper Speech Transformer to Human and Animal Voice Activity Detection}, elocation-id = {2023.09.30.560270}, year = {2023}, doi = {10.1101/2023.09.30.560270}, publisher = {Cold Spring Harbor Laboratory}, abstract = {This paper introduces WhisperSeg, utilizing the Whisper Transformer pre-trained for Automatic Speech Recognition (ASR) for human and animal Voice Activity Detection (VAD). Contrary to traditional methods that detect human voice or animal vocalizations from a short audio frame and rely on careful threshold selection, WhisperSeg processes entire spectrograms of long audio and generates plain text representations of onset, offset, and type of voice activity. Processing a longer audio context with a larger network greatly improves detection accuracy from few labeled examples. We further demonstrate a positive transfer of detection performance to new animal species, making our approach viable in the data-scarce multi-species setting.Competing Interest StatementThe authors have declared no competing interest.}, URL = {https://www.biorxiv.org/content/early/2023/10/02/2023.09.30.560270}, eprint = {https://www.biorxiv.org/content/early/2023/10/02/2023.09.30.560270.full.pdf}, journal = {bioRxiv} } ``` ## Contact nianlong.gu@uzh.ch

提供机构：

nccratliri

原始信息汇总

斑马雀语音数据集

数据集概述

该数据集是为WhisperSeg系统定制的斑马雀动物语音活动检测（语音分割）数据集。WhisperSeg利用预训练的Whisper Transformer进行自动语音识别（ASR），用于人类和动物的语音活动检测（VAD）。

数据集下载

python from huggingface_hub import snapshot_download snapshot_download(nccratliri/vad-zebra-finch, local_dir = "data/zebra-finch", repo_type="dataset")

引用信息

在使用此数据集时，请引用以下文献：

@article {Tomka2023.09.04.555475, author = {Tomas Tomka and Xinyu Hao and Aoxue Miao and Kanghwi Lee and Maris Basha and Stefan Reimann and Anja T Zai and Richard Hahnloser}, title = {Benchmarking nearest neighbor retrieval of zebra finch vocalizations across development}, elocation-id = {2023.09.04.555475}, year = {2023}, doi = {10.1101/2023.09.04.555475}, publisher = {Cold Spring Harbor Laboratory}, abstract = {Vocalizations are highly specialized motor gestures that regulate social interactions. The reliable detection of vocalizations from raw streams of microphone data remains an open problem even in research on widely studied animals such as the zebra finch. A promising method for finding vocal samples from potentially few labelled examples(templates) is nearest neighbor retrieval, but this method has never been extensively tested on vocal segmentation tasks. We retrieve zebra finch vocalizations as neighbors of each other in the sound spectrogram space. Based on merely 50 templates, we find excellent retrieval performance in adults (F1 score of 0.93 +/- 0.07) but not in juveniles (F1 score of 0.64 +/- 0.18), presumably due to the larger vocal variability of the latter. The performance in juveniles improves when retrieval is based on fixed-size template slices (F1 score of 0.72 +/- 0.10) instead of entire templates. Among the several distance metrics we tested such as the cosine and the Euclidean distance, we find that the Spearman distance largely outperforms all others. We release our expert-curated dataset of more than 50’000 zebra finch vocal segments, which will enable training of data-hungry machine-learning approaches.Competing Interest StatementThe authors have declared no competing interest.}, URL = {https://www.biorxiv.org/content/early/2023/09/04/2023.09.04.555475}, eprint = {https://www.biorxiv.org/content/early/2023/09/04/2023.09.04.555475.full.pdf}, journal = {bioRxiv} }

@article {Gu2023.09.30.560270, author = {Nianlong Gu and Kanghwi Lee and Maris Basha and Sumit Kumar Ram and Guanghao You and Richard Hahnloser}, title = {Positive Transfer of the Whisper Speech Transformer to Human and Animal Voice Activity Detection}, elocation-id = {2023.09.30.560270}, year = {2023}, doi = {10.1101/2023.09.30.560270}, publisher = {Cold Spring Harbor Laboratory}, abstract = {This paper introduces WhisperSeg, utilizing the Whisper Transformer pre-trained for Automatic Speech Recognition (ASR) for human and animal Voice Activity Detection (VAD). Contrary to traditional methods that detect human voice or animal vocalizations from a short audio frame and rely on careful threshold selection, WhisperSeg processes entire spectrograms of long audio and generates plain text representations of onset, offset, and type of voice activity. Processing a longer audio context with a larger network greatly improves detection accuracy from few labeled examples. We further demonstrate a positive transfer of detection performance to new animal species, making our approach viable in the data-scarce multi-species setting.Competing Interest StatementThe authors have declared no competing interest.}, URL = {https://www.biorxiv.org/content/early/2023/10/02/2023.09.30.560270}, eprint = {https://www.biorxiv.org/content/early/2023/10/02/2023.09.30.560270.full.pdf}, journal = {bioRxiv} }

搜集汇总

数据集介绍

构建方式

该数据集是专为动物语音活动检测（VAD）任务定制的斑胸草雀声音数据集，源自WhisperSeg项目。构建过程中，研究团队基于专家精心标注的超过五万个斑胸草雀发声片段，通过提取原始麦克风数据流中的声音信号，结合声音频谱空间中的最近邻检索技术，对发声片段进行精准分割与标注。数据集的整理依托于对成年与幼年斑胸草雀发声模式差异的系统考量，确保了标注的可靠性与代表性。最终，这些经过严格筛选与验证的片段被整合为标准化格式，以便于机器学习模型的训练与评估。

特点

本数据集的核心特点在于其高度专业化与精细化的标注质量。它包含了超过五万个由领域专家逐一手工校验的发声片段，覆盖成年与幼年斑胸草雀在不同发育阶段的多样发声模式。数据集不仅提供了清晰的发声起止时间戳与类型标签，还特别针对幼年个体发声变异性大的挑战进行了优化，通过固定尺寸模板切片策略提升了检索与分割的鲁棒性。此外，数据集与WhisperSeg框架无缝集成，支持对长音频频谱图的整体处理，从而在少量标注样本下实现卓越的检测性能。

使用方法

使用该数据集时，推荐通过Hugging Face的`snapshot_download`函数直接下载至本地目录，例如指定`local_dir = "data/zebra-finch"`即可快速获取完整数据。数据集可无缝接入WhisperSeg框架，用户需参照其GitHub仓库（https://github.com/nianlonggu/WhisperSeg）中的详细指南进行模型训练或评估。典型应用流程包括：加载数据集的音频片段与对应标签，利用Whisper Transformer的预训练权重进行微调，最终生成描述语音活动起止时间与类型的文本表示。引用数据集时，请务必附上相关论文的引用信息以尊重学术贡献。

背景与挑战

背景概述

在生物声学与语音处理交叉领域，语音活动检测（VAD）技术长期面临跨物种泛化能力的瓶颈。由苏黎世大学与苏黎世联邦理工学院的研究团队于2023年发布的nccratliri/vad-zebra-finch数据集，旨在推动基于预训练Transformer模型的动物发声检测研究。该数据集由Nianlong Gu、Kanghwi Lee等学者构建，核心研究问题在于探索Whisper语音Transformer在人类与动物VAD任务中的正向迁移能力。通过整合超过五万段斑胸草雀精准标记的发声片段，该数据集为验证WhisperSeg模型——一种利用长音频谱图生成发声起止时间与类型文本描述的框架——提供了基准资源。其影响力体现在打破传统短时帧检测与阈值选择的局限，为数据稀缺的多物种声学监测开辟了新范式。

当前挑战

该数据集面临的挑战具有双重维度。在领域问题层面，动物发声活动检测需应对幼鸟发声变异大导致的分类困难（F1分数仅0.64），以及跨物种迁移时标注样本稀缺的瓶颈。传统方法依赖手工特征与经验阈值，难以适应斑胸草雀发育期声学参数的动态变化。在构建过程中，研究者需解决从原始麦克风数据流中可靠分离发声段与背景噪声的难题，尤其需处理模板匹配策略中整段模板与固定尺寸切片间的性能差异（后者提升幼年个体检测F1至0.72）。此外，多距离度量（如斯皮尔曼距离优于余弦与欧氏距离）的选择验证，以及确保标注数据集的专家级准确性，均为构建阶段的关键技术挑战。

常用场景

经典使用场景

该数据集专为斑胸草雀的语音活动检测（VAD）任务而设计，是WhisperSeg模型在动物声学行为分析领域的经典训练与评估资源。基于Whisper Transformer架构的迁移学习，研究者可借助该数据集实现从原始音频流中对斑胸草雀发声片段的精确分割与标注。其典型应用包括将长时程麦克风录音转化为结构化的发声起止时间与类型文本描述，从而替代传统基于短时帧和阈值选择的检测方法，显著提升在少量标注样本下的检测鲁棒性。

实际应用

在实际场景中，该数据集可赋能自动化的动物行为监测系统，例如在生态学研究中用于无干扰地追踪野生鸟类的鸣叫模式与社交互动。它还可应用于实验神经科学领域，辅助解析鸣禽发声学习的神经机制，通过高精度检测减少人工标注的耗时与主观偏差。此外，其技术框架可延伸至生物多样性监测、宠物行为分析乃至野生动物保护中的种群声学普查，为构建非侵入式、可扩展的动物声学智能观测平台奠定数据基础。

衍生相关工作

该数据集直接衍生出WhisperSeg模型，该模型首次证明了Whisper语音Transformer在动物VAD任务上的正迁移效果，相关论文发表于预印本平台。此外，基于该数据集，研究者还开展了斑胸草雀发声的最近邻检索基准测试，评估了不同距离度量（如斯皮尔曼距离）在发声片段匹配中的表现，并发布了用于发育阶段比较的模板检索框架。这些工作共同推动了动物声学领域从传统信号处理向预训练大模型迁移学习的范式转变。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集