Event localization prediction accuracy (%).
收藏Figshare2025-05-23 更新2026-04-28 收录
下载链接:
https://figshare.com/articles/dataset/Event_localization_prediction_accuracy_/29138184
下载链接
链接失效反馈官方服务:
资源简介:
The growing reliance on video conferencing software brings significant benefits but also introduces challenges, particularly in managing audio quality. In multi-participant settings, ambient noise and interruptions can hinder speaker recognition and disrupt the flow of conversation. This work proposes an audio-visual source separation pipeline designed specifically for video conferencing and telepresence robots applications. The framework aims to isolate and enhance the speech of individual participants in noisy environments while enabling control over the volume of specific individuals captured in the video frame. The proposed pipeline comprises key components: a deep learning-based feature extractor for audio and video, an audio-guided visual attention mechanism, a module for background noise suppression and human voice separation, and Deep Multi-Resolution Network (DMRN) modules. For human voice separation, the DPRNN-TasNet, a robust deep neural network framework, is employed. Experimental results demonstrate that the methodology effectively isolates and amplifies individual participants’ speech, achieving a test accuracy of 71.88 % on both the AVE and Music 21 datasets.
视频会议软件的应用日趋广泛,在带来诸多显著益处的同时,也衍生出诸多挑战,尤以音频质量管控问题最为突出。在多参会者场景中,环境噪声与语音干扰不仅会阻碍说话人识别,还会扰乱对话的正常流程。为此,本研究提出了一种专为视频会议与远程临场机器人应用设计的音视频源分离处理流程。该框架旨在嘈杂环境下实现单个参会者语音的分离与增强,同时支持对视频帧中特定参会者的音量进行精准调控。所提出的处理流程包含四大核心组件:基于深度学习的音视频特征提取器、音频引导式视觉注意力机制、用于背景噪声抑制与人声分离的专用模块,以及深度多分辨率网络(Deep Multi-Resolution Network, DMRN)模块。在人声分离环节,本研究采用了鲁棒的深度神经网络框架DPRNN-TasNet。实验结果表明,该方法可有效分离并放大单个参会者的语音,在AVE与Music 21两个数据集上均实现了71.88%的测试准确率。
创建时间:
2025-05-23



