five

Dax99993/habla-augmented

收藏
Hugging Face2026-02-25 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Dax99993/habla-augmented
下载链接
链接失效反馈
官方服务:
资源简介:
--- configs: - config_name: default data_files: - split: train path: data/train-* - split: validation path: data/validation-* - split: test path: data/test-* dataset_info: features: - name: speaker_id dtype: string - name: country dtype: string - name: sex dtype: string - name: audio dtype: audio - name: model dtype: string - name: augmentation_algorithm dtype: int64 - name: label dtype: string splits: - name: train num_bytes: 26655483684 num_examples: 149930 - name: validation num_bytes: 668359226 num_examples: 3748 - name: test num_bytes: 690328849 num_examples: 3748 download_size: 25035826228 dataset_size: 28014171759 --- # HABLA-Augmented ## Overview **HABLA-Augmented** is an augmented version of the [HABLA](https://zenodo.org/records/7370805) audio dataset. The original HABLA dataset contains spoof and bonafide samples across 5 different Spanish dialects. This dataset differs from the original by the procedure applied for obtaining the subsets and application of data augmentation, which resulted in a dataset with 4 subsets, as follows: * Train (balanced & augmented ; train split) * Validation (balanced ; validation split) * Close-test (balanced ; test split) * Open-test (unbalanced, unseen speakers and synthesis models ; provided as an independent dataset) [HABLA-Open-Test](https://huggingface.co/datasets/Dax99993/habla-open-test) ## Open-test To obtain the open-test subset the original set was split in two subsets by speakers and synthesis models. First, 2 out of the 6 available synthesis models were reserved exclusively for testing. Additionally 20 % of the speakers were selected randomly from each accent and sex. The combination of both subsets generated the open-test subset. ```markdown | Model | # Samples | % Set | |-------------|-----------|-------| | StarGAN | 16000 | 46.4 | | CycleGAN | 6200 | 18 | | Diff | 4160 | 12 | | bonafide | 4075 | 11.85 | | TTS-Diff | 2378 | 6.9 | | TTS | 857 | 2.5 | | TTS-StarGAN | 808 | 2.35 | ``` with a proportion of samples (labels) ```markdown | Label | # Samples | % Set | |----------|-----------|-------| | bonafide | 30403 | 0.12 | | spoof | 4075 | 0.88 | ``` The remainder subsets were combined and further processed to create the train, validation and closed-test subsets. ## Under-sampling Due to HABLA containing more spoof samples than bonafide, the remainder set was under-sampled to match the number of bonafide samples and obtained a balanced set. Since each synthesis model has a different number of samples, the model synthesis with the least sample quantity were kept as they are and the models with more quantity were sampled randomly to complete match the bonafide samples quantity. Obtaining a total of with the following distribution ```markdown | Origin | # Samples | % Set | |-----------------------|-----------|-------| | CycleGAN | 6617 | 17.65 | | Diff | 6617 | 17.65 | | TTS (Microsoft Azure) | 3934 | 10.5 | | TTS-StarGAN | 1573 | 4.2 | | Bonafide | 18741 | 50 | |-----------------------|-----------|-------| | Total samples | 37482 | | ``` This set was split into train, validation and close-test subsets with proportion 0.8, 0.1 and 0.1 respectively, keeping the balanced proportion of bonafide and spoof samples. ```markdown | Subset | # Samples | % Set | |------------|-----------|-------| | Train | 29986 | 0.8 | | Validation | 3748 | 0.1 | | Close-test | 3748 | 0.1 | |-----------------------|-----------|-------| | Total samples | 37482 | | ``` with proportion of samples (labels) as stated in the table ```markdown | Label | # Samples | % Set | |----------|-----------|-------| | bonafide | 18741 | 0.5 | | spoof | 18741 | 0.5 | ``` ## Data augmentation The [RawBoost](https://arxiv.org/pdf/2111.04433) data augmentation technique was applied to the train subset, applying the following algorithms: * (4) Series Convolutive-Impulsive-Stationary noise * (5) Series Convolutive-Impulsive noise * (6) Series Convolutive-Stationary noise * (7) Series Impulsive-Stationary noise This in turn augmented by a factor of 5 the train subset obtaining a total of 149930 samples. ```markdown | Source | # Samples | % Set | |---------------|-----------|-------| | Original | 29986 | 0.2 | | Algorithm-4 | 29986 | 0.2 | | Algorithm-5 | 29986 | 0.2 | | Algorithm-6 | 29986 | 0.2 | | Algorithm-7 | 29986 | 0.2 | |---------------|-----------|-------| | Total samples | 149930 | | ``` As for the RawBoost parameters the default parameters provided in the paper were utilized. ## Metadata All subset contain metadata to facilitate filtering and allow for experimentation, the fields contained are the following * speaker_id: speaker unique ID ex: 'arf_06136' * country: speaker's accent values: {Argentina, Chile, Colombia, Peru, Venezuela} * sex: speaker's sex values: {Female, Male} * file_name: actual audio file name ex: 'arf_00610_00006739039.wav' * augmentation_algorithm: Integer encoding each rawboost augmentation algorithm utilized to augment audio data values: {0, 4, 5, 6, 7} notes: The encoding is described by 0 -> No augmentation (original), 4 -> Series Convolutive-Impulsive-Stationary noise 5 -> Series Convolutive-Impulsive noise 6 -> Series Convolutive-Stationary noise 7 -> Series Impulsive-Stationary noise * model: synthesized model utilized to generate audio values: {'CycleGAN', 'Diff', 'StarGAN', 'TTS', 'TTS-Diff', 'TTS-StarGAN', '-'} notes: '-' placeholder utilized in bonafide samples * label: category to which the audio belongs values: {spoof, bonafide}
提供机构:
Dax99993
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作