Codecfake dataset - development set
收藏NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/11169871
下载链接
链接失效反馈官方服务:
资源简介:
This dataset is the development set of the Codecfake dataset, corresponding to the manuscript "The Codecfake Dataset and Countermeasures for Universal Deepfake Audio Detection".
Abstract
With the proliferation of Audio Language Model (ALM) based deepfake audio, there is an urgent need for effective detection methods. Unlike traditional deepfake audio generation, which often involves multi-step processes culminating in vocoder usage, ALM directly utilizes neural codec methods to decode discrete codes into audio. Moreover, driven by large-scale data, ALMs exhibit remarkable robustness and versatility, posing a significant challenge to current audio deepfake detection (ADD)models. To effectively detect ALM-based deepfake audio, we focus on the mechanism of the ALM-based audio generation method, the conversion from neural codec to waveform. We initially construct the Codecfake dataset, an open-source large-scale dataset, including two languages, millions of audio samples, and various test conditions, tailored for ALM-based audio detection. Additionally, to achieve universal detection of deepfake audio and tackle domain ascent bias issue of original SAM, we proposethe CSAM strategy to learn a domain balanced and generalized minima. Experiment results demonstrate that co-training on Codecfake dataset and vocoded dataset with CSAM strategy yield the lowest average Equal Error Rate (EER) of 0.616% across all test conditions compared to baseline models.
Codecfake Dataset
Due to platform restrictions on the size of zenodo repositories, we have divided the Codecfake dataset into various subsets as shown in the table below:
Codecfake dataset
description
link
training set (part 1 of 3) & label
train_split.zip & train_split.z01 - train_split.z05
https://zenodo.org/records/13838106
training set (part 2 of 3)
train_split.z06 - train_split.z10
https://zenodo.org/records/13841652
training set (part 3 of 3)
train_split.z11 - train_split.z16
https://zenodo.org/records/13853860
development set
dev_split.zip & dev_split.z01 - dev_split.z02
https://zenodo.org/records/13841216
test set (part 1 of 2)
Codec test: C1.zip - C6.cip & ALM test: A1.zip - A3.zip
https://zenodo.org/records/13838823
test set (part 2 of 2)
Codec unseen test: C7.zip
https://zenodo.org/records/11125029
Countermeasure
The source code of the countermeasure and pre-trained model are available on GitHub https://github.com/xieyuankun/Codecfake.
The Codecfake dataset and pre-trained model are licensed with CC BY-NC-ND 4.0 license.
本数据集为Codecfake数据集的开发集,对应学术论文《Codecfake数据集与通用深度伪造音频检测对抗方法》(The Codecfake Dataset and Countermeasures for Universal Deepfake Audio Detection)。
摘要
随着基于音频语言模型(Audio Language Model, ALM)的深度伪造音频泛滥,业界亟需有效的检测方法。与传统深度伪造音频生成往往需历经多步流程并最终使用声码器(vocoder)不同,ALM直接利用神经编解码器(neural codec)将离散编码解码为音频波形。此外,依托大规模数据训练的ALM具备出色的鲁棒性与通用性,对当前的音频深度伪造检测(Audio Deepfake Detection, ADD)模型构成了严峻挑战。为实现对基于ALM的深度伪造音频的有效检测,我们聚焦于ALM音频生成方法的核心机制——从神经编解码器到波形的转换过程。我们首先构建了Codecfake数据集,这是一个开源大规模数据集,涵盖两种语言、数百万条音频样本与多样测试场景,专为基于ALM的音频检测任务定制。此外,为实现深度伪造音频的通用检测并解决原始SAM的域上升偏差问题,我们提出CSAM策略以学习域均衡且泛化的极小值点。实验结果表明,采用CSAM策略对Codecfake数据集与声码器生成数据集进行协同训练,在所有测试场景下的平均等错误率(Equal Error Rate, EER)最低可达0.616%,优于各类基准模型。
Codecfake数据集
由于Zenodo平台对仓库规模的限制,我们将Codecfake数据集划分为多个子集,如下所示:
1. 训练集(第1部分,共3部分)与标签:train_split.zip 及 train_split.z01 - train_split.z05,下载链接:https://zenodo.org/records/13838106
2. 训练集(第2部分,共3部分):train_split.z06 - train_split.z10,下载链接:https://zenodo.org/records/13841652
3. 训练集(第3部分,共3部分):train_split.z11 - train_split.z16,下载链接:https://zenodo.org/records/13853860
4. 开发集:dev_split.zip 及 dev_split.z01 - dev_split.z02,下载链接:https://zenodo.org/records/13841216
5. 测试集(第1部分,共2部分):包含编解码器测试集C1.zip - C6.cip 与 ALM测试集A1.zip - A3.zip,下载链接:https://zenodo.org/records/13838823
6. 测试集(第2部分,共2部分):编解码器未见测试集C7.zip,下载链接:https://zenodo.org/records/11125029
对抗措施
本对抗措施的源代码与预训练模型已托管于GitHub平台:https://github.com/xieyuankun/Codecfake。
Codecfake数据集与预训练模型采用CC BY-NC-ND 4.0开源许可协议。
创建时间:
2024-09-28



