250713 Enhance publishing丨Speaker Verification Based on Tide-Ripple Convolution Neural Network
收藏DataCite Commons2025-12-17 更新2026-05-05 收录
下载链接:
https://www.scidb.cn/detail?dataSetId=a232c98b082941c58002958208ef3f43
下载链接
链接失效反馈官方服务:
资源简介:
Solemnly declare: If you use this open source content in papers, books, academic reports and other works, please quote the following documents (the original link has the latest citation format):CHEN Chen, YI Zhixin, LI Dongyuan, CHEN Deyun. Speaker Verification Based on Tide-Ripple Convolution Neural Network[J]. Journal of Electronics & Information Technology. doi: 10.11999/JEIT250713Authors: CHEN Chen, , YI Zhixin, LI Dongyuan, CHEN DeyunAuthor:College of Computer Science and Technology, Harbin University of Science and TechnologyDOI:10.11999/JEIT250713Original:https://jeit.ac.cn/cn/article/doi/10.11999/JEIT250713Correspondents: Yi Zhixin,new961102@163.comOpen source date: December 10(th), 2025Funds: National Natural Science Foundation of China (62101163), Heilongjiang Provincial Natural Science Foundation (YQ2024F018), Key Research and development project of Heilongjiang Province (JD2023SJ20)Open source content1 speaker Verification Based on Tidal Convolution Neural Network-Recurrence CodeAbstract: Objective State-of-the-art speaker verification models often achieve high performance by using fixed receptive fields, at the expense of significant parameter counts and computational loads. Given the rich, multi-layered nature of speech, employing dynamic receptive fields to capture complex information remains a relatively unexplored area, with little intuition on what constitutes an effective design. Methods Inspired by the non-linear coupling behavior of a tidal surge, this study proposes Tide-Ripple Convolution (TR-Conv) to create a more "effective receptive field". TR-Conv first constructs primary/auxiliary receptive fields within a window using power-of-two interpolation. It then employs a scan-pooling mechanism to extract key information from outside the window and an operator mechanism to perceive fine-grained differences within it. Fusing these three components yields a variable receptive field that is multi-scale, dynamic, and effective. A Tide-Ripple Convolutional Neural Network (TR-CNN) is established to validate this approach. To address the issue of label noise in datasets, a total loss function is proposed, which fuses a None-Target with Dynamic Normalization (NTDN) loss with a weighted Sub-center AAM Loss variant to enhance model performance. Results and Discussions The proposed Tide-Ripple Convolutional Neural Network (TR-CNN) is systematically validated on VoxCeleb1-O/E/H benchmarks. Results confirm TR-CNN achieves a superior balance of accuracy, computation, and parameter efficiency (Table 1). Compared to the strong ECAPA-TDNN baseline, the TR-CNN (C=512, n=1) model yields significant relative EER reductions of 4.95%, 4.03%, and 6.03%, and MinDCF reductions of 31.55%, 17.14%, and 17.42% across the test sets, while using 32.7% fewer parameters and 23.5% less computation (Table 2). The optimal TR-CNN (C=1024, n=1) model sets a new performance benchmark, achieving EERs of 0.85%, 1.10%, and 2.05%. Robustness is enhanced via a novel total loss setup, which provides stable EER and MinDCF improvements during fine-tuning (Table 3). Further analysis, including ablation studies (Table 5, 6), component explorations (Fig 3, Table 4), and t-SNE visualizations (Fig. 4), collectively validate the effectiveness and robustness of each module within the TR-CNN architecture. Conclusions This research proposes a simple and effective Tide-Ripple Convolution (TR-Conv) layer, built upon the T-RRF. Experiments demonstrate that this approach captures a more expressive and effective receptive field, significantly reducing both parameter count and computational costs. It outperforms traditional one-dimensional convolution in speech signal modeling, demonstrating excellent lightweight characteristics and scalability. Furthermore, a total loss function, comprising the NTDN loss and a Sub-center AAM loss variant, is introduced. This ensures that the speaker embeddings generated by the network are more discriminative and robust, especially in the presence of mislabeled data. Looking ahead, TR-Conv holds promise as a general-purpose module for integration into deeper and more complex neural network architectures.The attachment is the open source code of the author's research results.
提供机构:
Science Data Bank
创建时间:
2025-12-10



