Tonic/WaxalNLP
收藏Hugging Face2026-04-16 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/Tonic/WaxalNLP
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: id
dtype: string
- name: speaker_id
dtype: string
- name: locale
dtype: string
- name: audio
dtype: audio
- name: text
dtype: string
- name: gender
dtype: string
splits:
- name: train
num_bytes: 5289296605.0
num_examples: 834
- name: test
num_bytes: 716327773.0
num_examples: 111
- name: validation
num_bytes: 565064607.0
num_examples: 97
download_size: 6146314403
dataset_size: 6570688985.0
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- split: test
path: data/test-*
- split: validation
path: data/validation-*
---
# Google WaxalNLP `Wolof` Re-alignment
**Google** introduced [`WAXAL`](https://huggingface.co/datasets/google/WaxalNLP/), a new open dataset for **21 African languages**, to tackle data scarcity and build inclusive speech technology. However, [the Wolof language has experienced alignment issues between the audio files and their transcriptions](https://huggingface.co/datasets/google/WaxalNLP/discussions/16), making the dataset unusable.
We therefore propose to correct this using a simple and effective approach:
1. For each audio clip, we generated a transcription using [Google Gemini ASR](https://ai.google.dev/gemini-api/docs/audio).
2. For each generated transcription, we calculated the [Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance) with all the initial transcriptions.
3. The lowest distance obtained indicates the most similar initial transcription to the one generated by the ASR.
4. The index corresponding to this initial transcription is the correct index that will be used to correct the misalignment.
We also identified a couple of corrupted files during the process that could not be read. As part of this filtering process:
- `171` samples were removed from the `train` split
- `20` samples were removed from the `test` split
- `22` samples were removed from the `validation` split
Leaving the final dataset with the following stats:
```
DatasetDict({
train: Dataset({
features: ['id', 'speaker_id', 'locale', 'gender', 'audio', 'text'],
num_rows: 834
})
test: Dataset({
features: ['id', 'speaker_id', 'locale', 'gender', 'audio', 'text'],
num_rows: 111
})
validation: Dataset({
features: ['id', 'speaker_id', 'locale', 'gender', 'audio', 'text'],
num_rows: 97
})
})
```
> NOTE: Some audio files show a duration of `00:00/00:00` in the HuggingFace player but play properly once loaded into your script.
## Dataset duration
Grouping by split:
| Split | Duration | Total (seconds) | Nb of samples |
| :--- | :--- | :--- | :--- |
| **Train** | 411 min 10 s | 24 670 s | 834 |
| **Test** | 52 min 18 s | 3 138 s | 111 |
| **Validation** | 39 min 46 s | 2 386 s | 97 |
| --- | --- | --- |
| **Total** | **503 min 15 s** | **30 195 s** | 1042 |
Grouping by speaker id:
| Split | Speaker ID | Duration (H, M, S) | Nb of samples | Gender |
| :--- | :--- | :--- | :--- | :--- |
| **Train** | 1 | 1 h 28 min 46 s | 150 | male |
| | 8 | 1 h 09 min 33 s | 129 | female |
| | 5 | 1 h 07 min 27 s | 128 | female |
| | 3 | 1 h 05 min 41 s | 171 | female |
| | 2 | 1 h 00 min 49 s | 128 | female |
| | 4 | 0 h 58 min 55 s | 128 | male |
| --- | --- | --- | --- | --- |
| **Test** | 2 | 0 h 15 min 12 s | 20 | female |
| | 3 | 0 h 11 min 20 s | 20 | female |
| | 5 | 0 h 10 min 02 s | 23 | female |
| | 1 | 0 h 06 min 00 s | 19 | male |
| | 4 | 0 h 05 min 52 s | 14 | male |
| | 8 | 0 h 03 min 52 s | 15 | female |
| --- | --- | --- | --- | --- |
| **Validation** | 8 | 0 h 07 min 57 s | 18 | female |
| | 2 | 0 h 07 min 10 s | 18 | female |
| | 4 | 0 h 07 min 08 s | 15 | male |
| | 1 | 0 h 06 min 40 s | 17 | male |
| | 3 | 0 h 05 min 52 s | 16 | female |
| | 5 | 0 h 05 min 00 s | 13 | female |
---
> The speakers' genders were missing from the initial dataset and were marked as `unknown`. To correct this, we started by grouping the audio files by `speaker_id`, then listened to samples from each speaker to manually determine their gender. We ended up identifying `06` genders: `02` males and `04` females.
## Load the dataset
You can download the dataset with the following script:
```python
from huggingface_hub import snapshot_download
snapshot_download(
repo_id = "galsenai/WaxalNLP",
repo_type = "dataset",
allow_patterns = "data/*.parquet",
local_dir = "./waxal_wol"
)
```
And then load the dataset with the following:
```python
from datasets import load_dataset
dataset = load_dataset("parquet", data_files={
"train": "waxal_wol/data/train-*.parquet",
"test": "waxal_wol/data/test-*.parquet",
"validation": "waxal_wol/data/validation-*.parquet",
})
print(dataset)
```
The notebook used to make these corrections is available on [Google Colab](https://drive.google.com/file/d/1PIZ1aRxjaGQ5TAk4rUBo-ItZylfzjGkL/view?usp=sharing) to help you fix similar issues in your language, pending the upcoming fixes planned by the Waxal project team.
> This work has been carried out by [Derguene](https://huggingface.co/derguene), with [Abdou Aziz](https://huggingface.co/abdouaziiz) who helped to identify the misalignment issue.
提供机构:
Tonic



