Tonic/WaxalNLP

Name: Tonic/WaxalNLP
Creator: Tonic
Published: 2026-04-16 09:27:06
License: 暂无描述

Hugging Face2026-04-16 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/Tonic/WaxalNLP

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: id dtype: string - name: speaker_id dtype: string - name: locale dtype: string - name: audio dtype: audio - name: text dtype: string - name: gender dtype: string splits: - name: train num_bytes: 5289296605.0 num_examples: 834 - name: test num_bytes: 716327773.0 num_examples: 111 - name: validation num_bytes: 565064607.0 num_examples: 97 download_size: 6146314403 dataset_size: 6570688985.0 configs: - config_name: default data_files: - split: train path: data/train-* - split: test path: data/test-* - split: validation path: data/validation-* --- # Google WaxalNLP `Wolof` Re-alignment **Google** introduced [`WAXAL`](https://huggingface.co/datasets/google/WaxalNLP/), a new open dataset for **21 African languages**, to tackle data scarcity and build inclusive speech technology. However, [the Wolof language has experienced alignment issues between the audio files and their transcriptions](https://huggingface.co/datasets/google/WaxalNLP/discussions/16), making the dataset unusable. We therefore propose to correct this using a simple and effective approach: 1. For each audio clip, we generated a transcription using [Google Gemini ASR](https://ai.google.dev/gemini-api/docs/audio). 2. For each generated transcription, we calculated the [Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance) with all the initial transcriptions. 3. The lowest distance obtained indicates the most similar initial transcription to the one generated by the ASR. 4. The index corresponding to this initial transcription is the correct index that will be used to correct the misalignment. We also identified a couple of corrupted files during the process that could not be read. As part of this filtering process: - `171` samples were removed from the `train` split - `20` samples were removed from the `test` split - `22` samples were removed from the `validation` split Leaving the final dataset with the following stats: ``` DatasetDict({ train: Dataset({ features: ['id', 'speaker_id', 'locale', 'gender', 'audio', 'text'], num_rows: 834 }) test: Dataset({ features: ['id', 'speaker_id', 'locale', 'gender', 'audio', 'text'], num_rows: 111 }) validation: Dataset({ features: ['id', 'speaker_id', 'locale', 'gender', 'audio', 'text'], num_rows: 97 }) }) ``` > NOTE: Some audio files show a duration of `00:00/00:00` in the HuggingFace player but play properly once loaded into your script. ## Dataset duration Grouping by split: | Split | Duration | Total (seconds) | Nb of samples | | :--- | :--- | :--- | :--- | | **Train** | 411 min 10 s | 24 670 s | 834 | | **Test** | 52 min 18 s | 3 138 s | 111 | | **Validation** | 39 min 46 s | 2 386 s | 97 | | --- | --- | --- | | **Total** | **503 min 15 s** | **30 195 s** | 1042 | Grouping by speaker id: | Split | Speaker ID | Duration (H, M, S) | Nb of samples | Gender | | :--- | :--- | :--- | :--- | :--- | | **Train** | 1 | 1 h 28 min 46 s | 150 | male | | | 8 | 1 h 09 min 33 s | 129 | female | | | 5 | 1 h 07 min 27 s | 128 | female | | | 3 | 1 h 05 min 41 s | 171 | female | | | 2 | 1 h 00 min 49 s | 128 | female | | | 4 | 0 h 58 min 55 s | 128 | male | | --- | --- | --- | --- | --- | | **Test** | 2 | 0 h 15 min 12 s | 20 | female | | | 3 | 0 h 11 min 20 s | 20 | female | | | 5 | 0 h 10 min 02 s | 23 | female | | | 1 | 0 h 06 min 00 s | 19 | male | | | 4 | 0 h 05 min 52 s | 14 | male | | | 8 | 0 h 03 min 52 s | 15 | female | | --- | --- | --- | --- | --- | | **Validation** | 8 | 0 h 07 min 57 s | 18 | female | | | 2 | 0 h 07 min 10 s | 18 | female | | | 4 | 0 h 07 min 08 s | 15 | male | | | 1 | 0 h 06 min 40 s | 17 | male | | | 3 | 0 h 05 min 52 s | 16 | female | | | 5 | 0 h 05 min 00 s | 13 | female | --- > The speakers' genders were missing from the initial dataset and were marked as `unknown`. To correct this, we started by grouping the audio files by `speaker_id`, then listened to samples from each speaker to manually determine their gender. We ended up identifying `06` genders: `02` males and `04` females. ## Load the dataset You can download the dataset with the following script: ```python from huggingface_hub import snapshot_download snapshot_download( repo_id = "galsenai/WaxalNLP", repo_type = "dataset", allow_patterns = "data/*.parquet", local_dir = "./waxal_wol" ) ``` And then load the dataset with the following: ```python from datasets import load_dataset dataset = load_dataset("parquet", data_files={ "train": "waxal_wol/data/train-*.parquet", "test": "waxal_wol/data/test-*.parquet", "validation": "waxal_wol/data/validation-*.parquet", }) print(dataset) ``` The notebook used to make these corrections is available on [Google Colab](https://drive.google.com/file/d/1PIZ1aRxjaGQ5TAk4rUBo-ItZylfzjGkL/view?usp=sharing) to help you fix similar issues in your language, pending the upcoming fixes planned by the Waxal project team. > This work has been carried out by [Derguene](https://huggingface.co/derguene), with [Abdou Aziz](https://huggingface.co/abdouaziiz) who helped to identify the misalignment issue.

提供机构：

Tonic

5,000+

优质数据集

54 个

任务类型

进入经典数据集