realnetworks-kontxt/arctic-hs

Name: realnetworks-kontxt/arctic-hs
Creator: realnetworks-kontxt
Published: 2024-12-19 09:56:19
License: 暂无描述

Hugging Face2024-12-19 更新2024-06-11 收录

下载链接：

https://hf-mirror.com/datasets/realnetworks-kontxt/arctic-hs

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 task_categories: - audio-classification language: - en tags: - speech - speech-classifiation - text-to-speech - spoofing - accents pretty_name: ARCTIC-HS size_categories: - 10K<n<100K --- # ARCTIC-HS An extension of the [CMU_ARCTIC](http://festvox.org/cmu_arctic/) and [L2-ARCTIC](https://psi.engr.tamu.edu/l2-arctic-corpus/) datasets for synthetic speech detection using text-to-speech, featured in the paper **Synthetic speech detection with Wav2Vec 2.0 in various language settings**. Specifically, the `symmetric` variants were used. This dataset is 1 of 3 used in the paper, the others being: - [FLEURS-HS](https://huggingface.co/datasets/realnetworks-kontxt/fleurs-hs) - the default train, dev and test sets - [FLEURS-HS VITS](https://huggingface.co/datasets/realnetworks-kontxt/fleurs-hs-vits) - test set containing (generally) more difficult synthetic samples - separated due to different licensing ## Dataset Details ### Dataset Description The dataset features 3 parts obtained from the 2 original datasets: - CMU (native) non-US English speakers - CMU (native) US English speakers - L2 (non-native) English speakers The original ARCTIC samples are used as `human` samples, while `synthetic` samples are generated using [Google Cloud Text-To-Speech](https://cloud.google.com/text-to-speech). The resulting `symmetric` datasets features exactly twice the samples of the original ones, but we also provide: - human samples that couldn't be paired - 4 speakers in entirety we couldn't pair with a TTS voice - a small amount of utterances unrelated to the A and B ARCTIC samples - synthetic samples that couldn't be paired - mostly when a human speaker didn't read the B ARCTIC samples - **Curated by:** [KONTXT by RealNetworks](https://realnetworks.com/kontxt) - **Funded by:** [RealNetworks](https://realnetworks.com/) - **Language(s) (NLP):** English - **License:** [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) for the code, [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/) for the dataset, however: - the human part of the dataset is under a **custom CMU license** - it should be compatible with **CC BY 4.0** - the human part of the L2 dataset is under **CC BY-NC 4.0** ### Dataset Sources The original ARCTIC sets were downloaded from their original sources. - **CMU_ARCTIC Repository:** [festvox.org](http://festvox.org/cmu_arctic/) - **L2-ARCTIC Repository:** [tamu.edu](https://psi.engr.tamu.edu/l2-arctic-corpus/) - **CMU_ARCTIC Paper:** [cmu.edu](https://www.cs.cmu.edu/~awb/papers/ssw5/arctic.pdf) - **L2-ARCTIC Paper:** [tamu.edu](https://psi.engr.tamu.edu/wp-content/uploads/2018/08/zhao2018interspeech.pdf) - **Paper:** Synthetic speech detection with Wav2Vec 2.0 in various language settings ## Uses This dataset is best used as a test set for accents. Each sample contains an `Audio` feature, and a label: `human` or `synthetic`. ### Direct Use The following snippet of code demonstrates loading the CMU non-US English speaker part of the dataset: ```python from datasets import load_dataset arctic_hs = load_dataset( "realnetworks-kontxt/arctic-hs", "cmu_non-us", split="test", trust_remote_code=True, ) ``` To load a different part, change `cmu_non-us` into one of the following: - `cmu_us` for CMU (native) US English speakers - `l2` for L2 (non-native) English speakers This dataset only has a `test` split. To load only the paired samples, append `_symmetric` to the name. For example, `cmu_non-us` will load the test set also containing human and synthetic samples without their counterpart, while `cmu_non-us_symmetric` will only load samples where there is both a human and synthetic variant. This is useful if you want to have perfectly balanced labels within speakers, and if you wish to exclude speakers for which there are no TTS counterparts at all. This is also the family of datasets used in the paper. The `trust_remote_code=True` parameter is necessary because this dataset uses a custom loader. To check out which code is being ran, check out the [loading script](./arctic-hs.py). ## Dataset Structure The dataset data is contained in the [data directory](https://huggingface.co/datasets/realnetworks-kontxt/arctic-hs/tree/main/data). There exists 1 directory per part. Within those directories, there are 2 further directories: - `splits` - `pairs` Within the `splits` folder, there is 1 file per split: - `test.tar.gz` Those `.tar.gz` files contain 2 directories: - `human` - `synthetic` Each of these directories contain `.wav` files. Keep in mind that these directories can't be merged as they share most of their file names. An identical file name implies a speaker-voice pair, ex. `human/arctic_a0001.wav` and `synthetic/arctic_a0001.wav`. The `pairs` folder contains a list of file names within each speaker, and whether or not there is a human-synthetic pair. Based on that metadata we determine which samples appear in `symmetric` datasets. Back to the part directories, each contain 2 metadata files, which are not used in the loaded dataset, but might be useful to researchers: - `speaker-metadata.csv` - contains the speaker IDs paired with their speech properties - `voice-metadata.csv` - contains speaker-TTS name pairs Finally, the `data` root contains a single metadata file, `prompts.csv`, which as the name would suggest, contains the prompt transcripts. The only samples for which there are no transcripts are the ARCTIC-C ones, for which we couldn't find a source in the internet. ### Sample A sample contains contains an Audio feature `audio`, and a string `label`. ``` { 'audio': { 'path': 'ahw/human/arctic_a0001.wav', 'array': array([0., 0., 0., ..., 0., 0., 0.]), 'sampling_rate': 16000 }, 'label': 'human' } ``` ## Citation The dataset is featured alongside our paper, **Synthetic speech detection with Wav2Vec 2.0 in various language settings**, which will be published on IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW). We'll provide links once it's available online. **BibTeX:** If you use this work, please cite us by including the following BibTeX reference: ``` @inproceedings{dropuljic-ssdww2v2ivls, author={Dropuljić, Branimir and Šuflaj, Miljenko and Jertec, Andrej and Obadić, Leo}, booktitle={{IEEE} International Conference on Acoustics, Speech, and Signal Processing, {ICASSP} 2024 - Workshops, Seoul, Republic of Korea, April 14-19, 2024}, title={Synthetic Speech Detection with Wav2vec 2.0 in Various Language Settings}, year={2024}, month={04}, pages={585-589}, publisher={{IEEE}}, volume={}, number={}, keywords={Synthetic speech detection;text-to-speech;wav2vec 2.0;spoofing attack;multilingualism}, url={https://doi.org/10.1109/ICASSPW62465.2024.10627750}, doi={10.1109/ICASSPW62465.2024.10627750} } ``` ## Dataset Card Authors - [Miljenko Šuflaj](https://huggingface.co/suflaj) ## Dataset Card Contact - [Miljenko Šuflaj](mailto:msuflaj@realnetworks.com)

提供机构：

realnetworks-kontxt

原始信息汇总

ARCTIC-HS 数据集概述

数据集描述

ARCTIC-HS 数据集是基于 CMU_ARCTIC 和 L2-ARCTIC 数据集的扩展，用于合成语音检测。该数据集在论文 Synthetic speech detection with Wav2Vec 2.0 in various language settings 中被使用，特别是 symmetric 变体。

数据集组成部分

CMU (native) non-US English speakers
CMU (native) US English speakers
L2 (non-native) English speakers

原始的 ARCTIC 样本作为 human 样本，而 synthetic 样本是通过 Google Cloud Text-To-Speech 生成的。

数据集特点

提供了 symmetric 数据集，样本数量是原始数据集的两倍。
包含无法配对的 human 和 synthetic 样本。

数据集来源

CMU_ARCTIC Repository: festvox.org
L2-ARCTIC Repository: tamu.edu

数据集许可

代码许可: Apache 2.0
数据集许可: CC BY 4.0
- 人类部分数据集使用 custom CMU license
- L2 数据集的人类部分使用 CC BY-NC 4.0

数据集结构

数据目录结构

每个部分有一个目录。
每个目录包含 splits 和 pairs 子目录。
splits 目录包含 test.tar.gz 文件，其中包含 human 和 synthetic 子目录。
pairs 目录包含每个说话者的文件名列表，以及是否存在人造配对。

元数据文件

speaker-metadata.csv：包含说话者 ID 及其语音属性。
voice-metadata.csv：包含说话者-TTS 名称对。
prompts.csv：包含提示转录。

数据集使用

加载数据集

python from datasets import load_dataset

arctic_hs = load_dataset( "realnetworks-kontxt/arctic-hs", "cmu_non-us", split="test", trust_remote_code=True, )

数据集样本

json { audio: { path: ahw/human/arctic_a0001.wav, array: array([0., 0., 0., ..., 0., 0., 0.]), sampling_rate: 16000 }, label: human }

引用

bibtex @inproceedings{dropuljic-ssdww2v2ivls author={Dropuljić, Branimir and Šuflaj, Miljenko and Jertec, Andrej and Obadić, Leo} booktitle={2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW)} title={Synthetic speech detection with Wav2Vec 2.0 in various language settings} year={2024} volume={} number={} pages={1-5} keywords={Synthetic speech detection;text-to-speech;wav2vec 2.0;spoofing attack;multilingualism} doi={} }

搜集汇总

数据集介绍

构建方式

在语音合成检测领域，ARCTIC-HS数据集以CMU_ARCTIC和L2-ARCTIC两大经典语音库为基础，通过精心设计构建而成。其构建过程首先从原始数据集中提取人类语音样本，涵盖母语非美式英语、母语美式英语及非母语英语三类说话者。随后，利用Google Cloud Text-To-Speech技术为这些样本生成对应的合成语音，形成人类与合成语音的配对。数据集特别提供了对称版本，确保每个说话者的人类样本均有对应的合成样本，同时保留了因技术限制未能配对的独立样本，从而构建了一个结构严谨、覆盖广泛的语音检测资源。

使用方法

研究者可通过Hugging Face平台便捷加载数据集，使用load_dataset函数并指定相应配置（如cmu_non-us、cmu_us或l2）即可获取测试集。数据集仅包含测试分割，适用于模型在口音多样性场景下的性能评估。若需平衡的说话者内样本对比，可选用对称版本（如cmu_non-us_symmetric），该版本仅包含配对成功的人类与合成语音，排除了无对应合成样本的说话者。加载时需启用trust_remote_code参数以运行自定义脚本，数据以音频文件与标签对的形式呈现，可直接用于基于深度学习的语音分类任务。

背景与挑战

背景概述

在语音技术迅猛发展的时代背景下，合成语音检测已成为保障语音系统安全性的关键研究方向。ARCTIC-HS数据集由RealNetworks旗下的KONTXT团队于2024年构建，作为CMU_ARCTIC与L2-ARCTIC数据集的扩展，专门用于合成语音检测任务。该数据集整合了原生与非原生英语使用者的语音样本，并利用谷歌云文本转语音技术生成对应的合成样本，旨在探究不同口音环境下合成语音的识别挑战。其核心研究问题聚焦于提升合成语音检测模型在多样化语言场景中的鲁棒性与泛化能力，相关成果已在IEEE ICASSP 2024研讨会上发表，为语音反欺骗领域提供了重要的基准资源。

当前挑战

ARCTIC-HS数据集所针对的领域挑战在于合成语音检测中口音多样性与语音自然度带来的识别困难。具体而言，数据集需区分由先进文本转语音系统生成的高质量合成语音与真实人类语音，尤其在非母语及区域性口音样本中，声学特征的微妙差异增加了分类的复杂性。在构建过程中，团队面临样本配对的技术挑战，例如部分原始语音样本缺乏对应的合成版本，导致数据不对称；同时，数据整合需协调不同许可协议，如CMU自定义许可与CC BY-NC 4.0之间的兼容性问题，这要求细致的法律与技术处理以确保数据集的合规性与可用性。

常用场景

经典使用场景

在语音合成检测领域，ARCTIC-HS数据集为评估模型在多样化口音环境下的鲁棒性提供了基准。该数据集整合了CMU_ARCTIC和L2-ARCTIC的原始语音样本，并辅以Google Cloud Text-To-Speech生成的合成语音，构建了包含母语与非母语英语说话者的对称测试集。研究者通常利用其平衡的人类与合成语音配对，训练和验证检测算法在跨口音场景中的泛化能力，尤其在应对美式英语、非美式英语及第二语言英语等复杂语音变体时，该数据集成为衡量模型性能的关键工具。

解决学术问题

ARCTIC-HS数据集致力于解决语音安全领域中的合成语音检测难题，特别是在多口音和跨语言设置下的学术挑战。通过提供涵盖不同口音背景的人类与合成语音样本，该数据集帮助研究者探索语音欺骗攻击的防御机制，评估检测模型在真实世界语音多样性中的稳定性。其对称变体设计确保了样本标签的平衡性，为研究口音对合成语音检测的影响提供了可控实验环境，从而推动了语音反欺骗技术在学术层面的深入探索与理论创新。

实际应用

在实际应用中，ARCTIC-HS数据集被广泛用于开发和优化语音身份验证系统、电话诈骗检测工具以及智能助理的安全防护模块。基于该数据集训练的模型能够有效识别由文本到语音技术生成的欺诈性语音，保护金融、客服和物联网设备免受语音欺骗攻击。尤其在全球化服务场景中，系统需要处理带有各种口音的语音输入，该数据集提供的多口音测试资源确保了检测方案在实际部署中的可靠性与适应性，为行业安全标准提供了数据支撑。

数据集最近研究