Codecfake dataset - training set (part 2 of 3)
收藏NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/11171719
下载链接
链接失效反馈官方服务:
资源简介:
This dataset is the training set (part 2 of 3) of the Codecfake dataset , corresponding to the manuscript "The Codecfake Dataset and Countermeasures for Universal Deepfake Audio Detection".
Abstract
With the proliferation of Audio Language Model (ALM) based deepfake audio, there is an urgent need for effective detection methods. Unlike traditional deepfake audio generation, which often involves multi-step processes culminating in vocoder usage, ALM directly utilizes neural codec methods to decode discrete codes into audio. Moreover, driven by large-scale data, ALMs exhibit remarkable robustness and versatility, posing a significant challenge to current audio deepfake detection (ADD)models. To effectively detect ALM-based deepfake audio, we focus on the mechanism of the ALM-based audio generation method, the conversion from neural codec to waveform. We initially construct the Codecfake dataset, an open-source large-scale dataset, including two languages, millions of audio samples, and various test conditions, tailored for ALM-based audio detection. Additionally, to achieve universal detection of deepfake audio and tackle domain ascent bias issue of original SAM, we proposethe CSAM strategy to learn a domain balanced and generalized minima. Experiment results demonstrate that co-training on Codecfake dataset and vocoded dataset with CSAM strategy yield the lowest average Equal Error Rate (EER) of 0.616% across all test conditions compared to baseline models.
Codecfake Dataset
Due to platform restrictions on the size of zenodo repositories, we have divided the Codecfake dataset into various subsets as shown in the table below:
Codecfake dataset
description
link
training set (part 1 of 3) & label
train_split.zip & train_split.z01 - train_split.z05
https://zenodo.org/records/13838106
training set (part 2 of 3)
train_split.z06 - train_split.z10
https://zenodo.org/records/13841652
training set (part 3 of 3)
train_split.z11 - train_split.z16
https://zenodo.org/records/13853860
development set
dev_split.zip & dev_split.z01 - dev_split.z02
https://zenodo.org/records/13841216
test set (part 1 of 2)
Codec test: C1.zip - C6.cip & ALM test: A1.zip - A3.zip
https://zenodo.org/records/13838823
test set (part 2 of 2)
Codec unseen test: C7.zip
https://zenodo.org/records/11125029
Countermeasure
The source code of the countermeasure and pre-trained model are available on GitHub https://github.com/xieyuankun/Codecfake.
The Codecfake dataset and pre-trained model are licensed with CC BY-NC-ND 4.0 license.
本数据集为Codecfake数据集的训练集(第2/3部分),对应论文《通用深度伪造音频检测的Codecfake数据集与对抗方案》。
摘要
随着基于音频语言模型(Audio Language Model, ALM)的深度伪造音频泛滥,业界亟需高效的检测方法。与传统深度伪造音频生成流程不同——后者通常历经多阶段处理,最终依赖声码器生成音频——音频语言模型直接通过神经编解码器将离散编码解码为音频信号。此外,得益于大规模数据训练,音频语言模型展现出极强的鲁棒性与通用性,对当前的音频深度伪造检测(Audio Deepfake Detection, ADD)模型构成了严峻挑战。为实现基于音频语言模型的深度伪造音频的有效检测,本研究聚焦音频语言模型的音频生成机制,即神经编解码器到波形的转换过程。我们首次构建了Codecfake数据集——一款面向音频语言模型音频检测的开源大规模数据集,涵盖两种语言、百万级音频样本与多样测试场景。此外,为实现深度伪造音频的通用检测,并解决原始SAM的域上升偏差问题,我们提出CSAM策略以学习域均衡且泛化的极小值点。实验结果表明,结合CSAM策略在Codecfake数据集与声码器生成数据集上进行协同训练,相较于基线模型,在全测试场景下的平均等错误率(Equal Error Rate, EER)最低仅为0.616%。
Codecfake数据集
鉴于Zenodo仓库的平台容量限制,我们将Codecfake数据集划分为如下多个子集:
数据集子集 资源文件说明 下载链接
训练集(第1/3部分)及标签 train_split.zip 与 train_split.z01 ~ train_split.z05 https://zenodo.org/records/13838106
训练集(第2/3部分) train_split.z06 ~ train_split.z10 https://zenodo.org/records/13841652
训练集(第3/3部分) train_split.z11 ~ train_split.z16 https://zenodo.org/records/13853860
开发集 dev_split.zip 与 dev_split.z01 ~ dev_split.z02 https://zenodo.org/records/13841216
测试集(第1/2部分) 编解码器测试集:C1.zip ~ C6.zip;ALM测试集:A1.zip ~ A3.zip https://zenodo.org/records/13838823
测试集(第2/2部分) 未见过的编解码器测试集:C7.zip https://zenodo.org/records/11125029
对抗方案
本对抗方案的源代码与预训练模型已开源至GitHub:https://github.com/xieyuankun/Codecfake。
Codecfake数据集与预训练模型采用CC BY-NC-ND 4.0协议进行授权。
创建时间:
2024-09-28



