five

CodecFake-Omni|深度伪造语音检测数据集|编解码器数据集

收藏
arXiv2025-01-15 更新2025-01-16 收录
深度伪造语音检测
编解码器
下载链接:
http://arxiv.org/abs/2501.08238v1
下载链接
链接失效反馈
资源简介:
CodecFake-Omni是由国立台湾大学等机构创建的大规模数据集,旨在研究基于神经编解码器的深度伪造语音检测。该数据集包含31种不同的开源编解码器模型生成的训练数据,以及17种先进的CoSG模型生成的测试数据。数据集通过重新合成真实语音生成训练数据,测试数据则来自未发布的模型生成的语音。CodecFake-Omni是目前最大的CodecFake语料库,涵盖了最广泛的编解码器架构。该数据集的应用领域主要是深度伪造语音检测,旨在解决当前反欺骗模型在检测由CoSG系统生成的合成语音时的不足问题。
提供机构:
国立台湾大学
创建时间:
2025-01-15
AI搜集汇总
数据集介绍
main_image_url
构建方式
CodecFake-Omni数据集的构建基于神经音频编解码器(Neural Audio Codec)技术,旨在推动基于编解码器的深度伪造语音(CodecFake)检测研究。训练集通过使用21个不同编解码器家族的31个开源神经音频编解码器模型对真实语音进行重新合成生成。评估集则包含从17个先进的编解码器语音生成(CoSG)模型中收集的网络数据,涵盖了8个编解码器家族。通过这种大规模数据集的构建,研究者能够验证传统反欺骗模型在面对现代编解码器生成的语音时的局限性,并提出了一种全面的神经音频编解码器分类法,为未来的CodecFake检测研究提供了宝贵的见解。
特点
CodecFake-Omni数据集是目前已知的最大规模的CodecFake语音数据集,涵盖了最广泛的编解码器架构。其训练集通过31个不同的开源编解码器模型生成,评估集则包含来自17个CoSG模型的语音数据。该数据集不仅规模庞大,还通过编解码器分类法对编解码器进行了系统化的分层分析,揭示了编解码器属性与CodecFake检测性能之间的关系。例如,使用具有解纠缠辅助目标的编解码器重新合成的数据在检测CodecFake语音时表现出更好的性能。
使用方法
CodecFake-Omni数据集的使用方法主要包括训练和评估两个阶段。在训练阶段,研究者可以使用数据集中的重新合成语音(CoRS)来训练反欺骗模型。评估阶段则分为两部分:一是对重新合成语音的评估,二是对CoSG模型生成的伪造语音的评估。通过这种分阶段的评估,研究者能够全面测试模型在不同场景下的性能。此外,数据集还支持基于编解码器分类法的分层分析,帮助研究者深入理解编解码器属性对检测性能的影响,从而优化反欺骗模型的开发。
背景与挑战
背景概述
CodecFake-Omni数据集由台湾大学的研究团队于2025年创建,旨在应对基于神经音频编解码器(CoSG)生成的深度伪造语音(CodecFake)检测问题。随着CoSG系统的快速发展,生成逼真的伪造语音变得愈发容易,这对信息安全和社会信任构成了严重威胁。CodecFake-Omni是目前规模最大、涵盖最广泛编解码器架构的数据集,包含31种开源神经音频编解码器模型生成的训练数据,以及17种先进CoSG模型生成的测试数据。该数据集的发布推动了反欺骗领域的研究进展,特别是在检测新型深度伪造语音方面具有重要意义。
当前挑战
CodecFake-Omni数据集面临的挑战主要体现在两个方面。首先,在领域问题方面,传统的反欺骗模型难以有效检测由CoSG系统生成的伪造语音,因为这些语音与传统的语音合成模型生成的语音在声学特性上存在显著差异。其次,在数据集构建过程中,研究人员面临了编解码器模型多样性带来的复杂性挑战。为了构建一个全面的数据集,研究人员需要整合多种编解码器架构,并定义系统的神经音频编解码器分类法,以便更好地理解和分析这些模型。此外,测试数据的收集也面临隐私和模型未公开的挑战,研究人员只能从公开的演示页面获取数据,这增加了数据集的构建难度。
常用场景
经典使用场景
CodecFake-Omni数据集主要用于研究基于神经音频编解码器的深度伪造语音(CodecFake)检测。该数据集通过重新合成语音,涵盖了多种神经音频编解码器架构,为开发反欺骗模型提供了丰富的训练和测试数据。其经典使用场景包括训练和评估反欺骗模型,以检测由CodecFake生成的伪造语音,尤其是在面对新兴的神经音频编解码器生成系统时,能够有效提升检测性能。
衍生相关工作
CodecFake-Omni数据集衍生了一系列相关研究工作,尤其是在深度伪造语音检测领域。基于该数据集的研究提出了多种新型反欺骗模型,如基于Vocos和FACodec的模型,这些模型在检测CodecFake语音时表现出色。此外,该数据集还推动了神经音频编解码器分类法的研究,为未来的深度伪造检测提供了系统化的分析框架。相关研究还探索了不同编解码器属性对检测性能的影响,进一步推动了该领域的技术进步。
数据集最近研究
最新研究方向
随着基于神经编解码器的语音生成(CoSG)系统的快速发展,生成模仿个人身份的虚假语音并传播错误信息变得异常容易。CodecFake-Omni数据集的推出,标志着在检测由CoSG系统生成的深度伪造语音(CodecFake)领域迈出了重要一步。该数据集是目前为止规模最大、涵盖最广泛编解码架构的深度伪造语音检测数据集,包含了31种开源神经音频编解码模型生成的训练数据和17种先进CoSG模型生成的测试数据。通过这一大规模数据集,研究揭示了传统反欺骗模型在检测当前CoSG系统生成的合成语音时表现不佳的现象,并提出了一个全面的神经音频编解码分类法,为未来的CodecFake检测研究提供了宝贵的见解。CodecFake-Omni不仅为开发针对CodecFake的反欺骗模型提供了丰富的资源,还通过分层分析揭示了编解码属性与检测性能之间的关系,推动了该领域的前沿研究。
相关研究论文
  • 1
    CodecFake-Omni: A Large-Scale Codec-based Deepfake Speech Dataset国立台湾大学 · 2025年
以上内容由AI搜集并总结生成
用户留言
有没有相关的论文或文献参考?
这个数据集是基于什么背景创建的?
数据集的作者是谁?
能帮我联系到这个数据集的作者吗?
这个数据集如何下载?
点击留言
数据主题
具身智能
数据集  4098个
机构  8个
大模型
数据集  439个
机构  10个
无人机
数据集  37个
机构  6个
指令微调
数据集  36个
机构  6个
蛋白质结构
数据集  50个
机构  8个
空间智能
数据集  21个
机构  5个
5,000+
优质数据集
54 个
任务类型
进入经典数据集
热门数据集

中国区域交通网络数据集

该数据集包含中国各区域的交通网络信息,包括道路、铁路、航空和水路等多种交通方式的网络结构和连接关系。数据集详细记录了各交通节点的位置、交通线路的类型、长度、容量以及相关的交通流量信息。

data.stats.gov.cn 收录

UAVDT

UAVDT数据集由中国科学院大学等机构创建,包含约80,000帧从10小时无人机拍摄视频中精选的图像,覆盖多种复杂城市环境。数据集主要关注车辆目标,每帧均标注了边界框及多达14种属性,如天气条件、飞行高度、相机视角等。该数据集旨在推动无人机视觉技术在不受限制场景下的研究,解决高密度、小目标、相机运动等挑战,适用于物体检测、单目标跟踪和多目标跟踪等基础视觉任务。

arXiv 收录

GME Data

关于2021年GameStop股票活动的数据,包括每日合并的GME短期成交量数据、每日失败交付数据、可借股数、期权链数据以及不同时间框架的开盘/最高/最低/收盘/成交量条形图。

github 收录

VEDAI

用于训练YOLO模型的VEDAI数据集,包含图像和标签,用于目标检测和跟踪。

github 收录

Canadian Census

**Overview** The data package provides demographics for Canadian population groups according to multiple location categories: Forward Sortation Areas (FSAs), Census Metropolitan Areas (CMAs) and Census Agglomerations (CAs), Federal Electoral Districts (FEDs), Health Regions (HRs) and provinces. **Description** The data are available through the Canadian Census and the National Household Survey (NHS), separated or combined. The main demographic indicators provided for the population groups, stratified not only by location but also for the majority by demographical and socioeconomic characteristics, are population number, females and males, usual residents and private dwellings. The primary use of the data at the Health Region level is for health surveillance and population health research. Federal and provincial departments of health and human resources, social service agencies, and other types of government agencies use the information to monitor, plan, implement and evaluate programs to improve the health of Canadians and the efficiency of health services. Researchers from various fields use the information to conduct research to improve health. Non-profit health organizations and the media use the health region data to raise awareness about health, an issue of concern to all Canadians. The Census population counts for a particular geographic area representing the number of Canadians whose usual place of residence is in that area, regardless of where they happened to be on Census Day. Also included are any Canadians who were staying in that area on Census Day and who had no usual place of residence elsewhere in Canada, as well as those considered to be 'non-permanent residents'. National Household Survey (NHS) provides demographic data for various levels of geography, including provinces and territories, census metropolitan areas/census agglomerations, census divisions, census subdivisions, census tracts, federal electoral districts and health regions. In order to provide a comprehensive overview of an area, this product presents data from both the NHS and the Census. NHS data topics include immigration and ethnocultural diversity; aboriginal peoples; education and labor; mobility and migration; language of work; income and housing. 2011 Census data topics include population and dwelling counts; age and sex; families, households and marital status; structural type of dwelling and collectives; and language. The data are collected for private dwellings occupied by usual residents. A private dwelling is a dwelling in which a person or a group of persons permanently reside. Information for the National Household Survey does not include information for collective dwellings. Collective dwellings are dwellings used for commercial, institutional or communal purposes, such as a hotel, a hospital or a work camp. **Benefits** - Useful for canada public health stakeholders, for public health specialist or specialized public and other interested parties. for health surveillance and population health research. for monitoring, planning, implementation and evaluation of health-related programs. media agencies may use the health regions data to raise awareness about health, an issue of concern to all canadians. giving the addition of longitude and latitude in some of the datasets the data can be useful to transpose the values into geographical representations. the fields descriptions along with the dataset description are useful for the user to quickly understand the data and the dataset. **License Information** The use of John Snow Labs datasets is free for personal and research purposes. For commercial use please subscribe to the [Data Library](https://www.johnsnowlabs.com/marketplace/) on John Snow Labs website. The subscription will allow you to use all John Snow Labs datasets and data packages for commercial purposes. **Included Datasets** - [Canadian Population and Dwelling by FSA 2011](https://www.johnsnowlabs.com/marketplace/canadian-population-and-dwelling-by-fsa-2011) - This Canadian Census dataset covers data on population, total private dwellings and private dwellings occupied by usual residents by forward sortation area (FSA). It is enriched with the percentage of the population or dwellings versus the total amount as well as the geographical area, province, and latitude and longitude. The whole Canada's population is marked as 100, referring to 100% for the percentages. - [Detailed Canadian Population Statistics by CMAs and CAs 2011](https://www.johnsnowlabs.com/marketplace/detailed-canadian-population-statistics-by-cmas-and-cas-2011) - This dataset covers the population statistics of Canada by Census Metropolitan Areas (CMAs) and Census Agglomerations (CAs). It is categorized also by citizen/immigration status, ethnic origin, religion, mobility, education, language, work, housing, income etc. There is detailed characteristics categorization within these stated categories that are in 5 layers. - [Detailed Canadian Population Statistics by FED 2011](https://www.johnsnowlabs.com/marketplace/detailed-canadian-population-statistics-by-fed-2011) - This dataset covers the population statistics of Canada from 2011 by Federal Electoral District of 2013 Representation Order. It is categorized also by citizen/immigration status, ethnic origin, religion, mobility, education, language, work, housing, income etc. There is detailed characteristics categorization within these stated categories that are in 5 layers. - [Detailed Canadian Population Statistics by Health Region 2011](https://www.johnsnowlabs.com/marketplace/detailed-canadian-population-statistics-by-health-region-2011) - This dataset covers the population statistics of Canada by health region. It is categorized also by citizen/immigration status, ethnic origin, religion, mobility, education, language, work, housing, income etc. There is detailed characteristics categorization within these stated categories that are in 5 layers. - [Detailed Canadian Population Statistics by Province 2011](https://www.johnsnowlabs.com/marketplace/detailed-canadian-population-statistics-by-province-2011) - This dataset covers the population statistics of Canada by provinces and territories. It is categorized also by citizen/immigration status, ethnic origin, religion, mobility, education, language, work, housing, income etc. There is detailed characteristics categorization within these stated categories that are in 5 layers. **Data Engineering Overview** **We deliver high-quality data** - Each dataset goes through 3 levels of quality review - 2 Manual reviews are done by domain experts - Then, an automated set of 60+ validations enforces every datum matches metadata & defined constraints - Data is normalized into one unified type system - All dates, unites, codes, currencies look the same - All null values are normalized to the same value - All dataset and field names are SQL and Hive compliant - Data and Metadata - Data is available in both CSV and Apache Parquet format, optimized for high read performance on distributed Hadoop, Spark & MPP clusters - Metadata is provided in the open Frictionless Data standard, and its every field is normalized & validated - Data Updates - Data updates support replace-on-update: outdated foreign keys are deprecated, not deleted **Our data is curated and enriched by domain experts** Each dataset is manually curated by our team of doctors, pharmacists, public health & medical billing experts: - Field names, descriptions, and normalized values are chosen by people who actually understand their meaning - Healthcare & life science experts add categories, search keywords, descriptions and more to each dataset - Both manual and automated data enrichment supported for clinical codes, providers, drugs, and geo-locations - The data is always kept up to date – even when the source requires manual effort to get updates - Support for data subscribers is provided directly by the domain experts who curated the data sets - Every data source’s license is manually verified to allow for royalty-free commercial use and redistribution. **Need Help?** If you have questions about our products, contact us at [info@johnsnowlabs.com](mailto:info@johnsnowlabs.com).

Databricks 收录