Spoof detection using voice contribution on LFCC features and ResNet-34
收藏DataCite Commons2024-09-13 更新2025-04-16 收录
下载链接:
http://doi.nrct.go.th/?page=resolve_doi&resolve_doi=10.14457/TU.the.2023.635
下载链接
链接失效反馈官方服务:
资源简介:
Recent advancements in biometric authentication, particularly within the realm of speaker verification, have been notable. However, despite these strides, the persisting vulnerability to spoofing attacks is evident, necessitating specialized measures for detection across various attack types. This study focuses specifically on the identification of replay, speech synthesis, and voice conversion attacks. Our approach to spoof detection involves the utilization of linear frequency cepstral coefficients (LFCC) for the extraction of front-end features, coupled with ResNet-34 for the discrimination between genuine and spoofed speech samples. Through the integration of LFCC with ResNet-34, we rigorously evaluated our proposed method using the ASVspoof 2019 dataset. We investigated scenarios involving Physical Access (PA), focusing on replay attacks, and Logical Access (LA), which encompassed speech synthesis and voice conversion attacks. In our investigation, we compare the efficacy of utilizing the entire utterance for feature extraction against an alternative method that extracts features from a specific segment of the voice within the utterance for classification. Additionally, we conducted a comprehensive evaluation by benchmarking our proposed method against established baseline techniques, namely linear frequency cepstral coefficients - gaussian mixture model (LFCC-GMM) and constant Q cepstral coefficients - gaussian mixture model (CQCC-GMM), as well as contemporary state-of-the-art approaches. The results of our study demonstrate promising performance outcomes. Specifically, our proposed method achieves an equal error rate (EER) of 1.85% and 2.74% for replay attacks (PA) in the development and evaluation datasets, respectively. For voice conversion and speech synthesis attacks (LA), the method attains EER of 0.01% and 5.16% in the corresponding datasets. These findings underscore the effectiveness of our method in identifying spoof attacks across both PA and LA scenarios. Furthermore, we extend our analysis by conducting cross-dataset validation and addressing gender bias to thoroughly evaluate the robustness and generalizability of our model. These additional assessments provide further insights into the performance and reliability of our proposed approach in real-world settings.
近年来,生物特征认证(biometric authentication)领域,尤其是说话人验证(speaker verification)方向,取得了显著进展。然而,尽管取得了这些突破,此类系统仍存在明显的欺骗攻击(spoofing attacks)脆弱性,因此需要针对各类攻击类型研发专门的检测手段。本研究聚焦于重放攻击(replay attacks)、语音合成(speech synthesis)与语音转换(voice conversion)这三类欺骗攻击的识别任务。我们提出的欺骗检测方案,采用线性频率倒谱系数(linear frequency cepstral coefficients, LFCC)提取前端特征,并结合ResNet-34实现真实语音与欺骗语音样本的判别。通过将LFCC与ResNet-34相结合,我们基于ASVspoof 2019数据集对所提方法进行了严格评估。我们分别针对物理访问(Physical Access, PA)场景(重点关注重放攻击)与逻辑访问(Logical Access, LA)场景(涵盖语音合成与语音转换攻击)开展了实验。在研究中,我们对比了两种特征提取策略的效果:一种是对整段语音话语进行特征提取,另一种则仅从话语中的特定语音片段提取特征用于分类。此外,我们还通过与现有基准技术——线性频率倒谱系数-高斯混合模型(LFCC-GMM)、常数Q倒谱系数-高斯混合模型(CQCC-GMM)——以及当前主流的先进方法进行对比,完成了全面的性能基准测试。本研究的实验结果展现出优异的性能表现。具体而言,针对PA场景下的重放攻击,所提方法在开发集与评估集上的等错误率(equal error rate, EER)分别为1.85%与2.74%;针对由语音转换与语音合成攻击组成的LA场景,该方法在对应数据集上的EER分别为0.01%与5.16%。上述结果证实了所提方法在PA与LA两类场景下识别欺骗攻击的有效性。此外,我们还通过跨数据集验证与性别偏差分析,进一步评估了模型的鲁棒性与泛化能力。这些额外的评估实验为该方法在真实场景下的性能与可靠性提供了更深入的佐证。
提供机构:
Thammasat University
创建时间:
2024-09-13



