Dax99993/habla-augmented
收藏Hugging Face2026-02-25 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Dax99993/habla-augmented
下载链接
链接失效反馈官方服务:
资源简介:
---
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- split: validation
path: data/validation-*
- split: test
path: data/test-*
dataset_info:
features:
- name: speaker_id
dtype: string
- name: country
dtype: string
- name: sex
dtype: string
- name: audio
dtype: audio
- name: model
dtype: string
- name: augmentation_algorithm
dtype: int64
- name: label
dtype: string
splits:
- name: train
num_bytes: 26655483684
num_examples: 149930
- name: validation
num_bytes: 668359226
num_examples: 3748
- name: test
num_bytes: 690328849
num_examples: 3748
download_size: 25035826228
dataset_size: 28014171759
---
# HABLA-Augmented
## Overview
**HABLA-Augmented** is an augmented version of the [HABLA](https://zenodo.org/records/7370805) audio dataset.
The original HABLA dataset contains spoof and bonafide samples across 5 different Spanish dialects.
This dataset differs from the original by the procedure applied for obtaining the subsets and application of data augmentation,
which resulted in a dataset with 4 subsets, as follows:
* Train (balanced & augmented ; train split)
* Validation (balanced ; validation split)
* Close-test (balanced ; test split)
* Open-test (unbalanced, unseen speakers and synthesis models ; provided as an independent dataset) [HABLA-Open-Test](https://huggingface.co/datasets/Dax99993/habla-open-test)
## Open-test
To obtain the open-test subset the original set was split in two subsets by speakers and synthesis models.
First, 2 out of the 6 available synthesis models were reserved exclusively for testing.
Additionally 20 % of the speakers were selected randomly from each accent and sex. The combination of both subsets generated the open-test subset.
```markdown
| Model | # Samples | % Set |
|-------------|-----------|-------|
| StarGAN | 16000 | 46.4 |
| CycleGAN | 6200 | 18 |
| Diff | 4160 | 12 |
| bonafide | 4075 | 11.85 |
| TTS-Diff | 2378 | 6.9 |
| TTS | 857 | 2.5 |
| TTS-StarGAN | 808 | 2.35 |
```
with a proportion of samples (labels)
```markdown
| Label | # Samples | % Set |
|----------|-----------|-------|
| bonafide | 30403 | 0.12 |
| spoof | 4075 | 0.88 |
```
The remainder subsets were combined and further processed to create the train, validation and closed-test subsets.
## Under-sampling
Due to HABLA containing more spoof samples than bonafide,
the remainder set was under-sampled to match the number of bonafide samples and obtained a balanced set.
Since each synthesis model has a different number of samples, the model synthesis with the least sample quantity were kept as they are and the models with more quantity were sampled randomly to complete match the bonafide samples quantity.
Obtaining a total of with the following distribution
```markdown
| Origin | # Samples | % Set |
|-----------------------|-----------|-------|
| CycleGAN | 6617 | 17.65 |
| Diff | 6617 | 17.65 |
| TTS (Microsoft Azure) | 3934 | 10.5 |
| TTS-StarGAN | 1573 | 4.2 |
| Bonafide | 18741 | 50 |
|-----------------------|-----------|-------|
| Total samples | 37482 | |
```
This set was split into train, validation and close-test subsets with proportion 0.8, 0.1 and 0.1 respectively, keeping the balanced proportion of bonafide and spoof samples.
```markdown
| Subset | # Samples | % Set |
|------------|-----------|-------|
| Train | 29986 | 0.8 |
| Validation | 3748 | 0.1 |
| Close-test | 3748 | 0.1 |
|-----------------------|-----------|-------|
| Total samples | 37482 | |
```
with proportion of samples (labels) as stated in the table
```markdown
| Label | # Samples | % Set |
|----------|-----------|-------|
| bonafide | 18741 | 0.5 |
| spoof | 18741 | 0.5 |
```
## Data augmentation
The [RawBoost](https://arxiv.org/pdf/2111.04433) data augmentation technique was applied to the train subset, applying the following algorithms:
* (4) Series Convolutive-Impulsive-Stationary noise
* (5) Series Convolutive-Impulsive noise
* (6) Series Convolutive-Stationary noise
* (7) Series Impulsive-Stationary noise
This in turn augmented by a factor of 5 the train subset obtaining a total of 149930 samples.
```markdown
| Source | # Samples | % Set |
|---------------|-----------|-------|
| Original | 29986 | 0.2 |
| Algorithm-4 | 29986 | 0.2 |
| Algorithm-5 | 29986 | 0.2 |
| Algorithm-6 | 29986 | 0.2 |
| Algorithm-7 | 29986 | 0.2 |
|---------------|-----------|-------|
| Total samples | 149930 | |
```
As for the RawBoost parameters the default parameters provided in the paper were utilized.
## Metadata
All subset contain metadata to facilitate filtering and allow for experimentation, the fields contained are the following
* speaker_id: speaker unique ID
ex: 'arf_06136'
* country: speaker's accent
values: {Argentina, Chile, Colombia, Peru, Venezuela}
* sex: speaker's sex
values: {Female, Male}
* file_name: actual audio file name
ex: 'arf_00610_00006739039.wav'
* augmentation_algorithm: Integer encoding each rawboost augmentation algorithm utilized to augment audio data
values: {0, 4, 5, 6, 7}
notes: The encoding is described by
0 -> No augmentation (original),
4 -> Series Convolutive-Impulsive-Stationary noise
5 -> Series Convolutive-Impulsive noise
6 -> Series Convolutive-Stationary noise
7 -> Series Impulsive-Stationary noise
* model: synthesized model utilized to generate audio
values: {'CycleGAN', 'Diff', 'StarGAN', 'TTS', 'TTS-Diff', 'TTS-StarGAN', '-'}
notes: '-' placeholder utilized in bonafide samples
* label: category to which the audio belongs
values: {spoof, bonafide}
提供机构:
Dax99993



