ik/akan-tts-wavtokenizer-combined

Hugging Face2026-03-20 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/ik/akan-tts-wavtokenizer-combined

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - ak - tw license: cc-by-sa-4.0 tags: - tts - speech - akan - twi - wavtokenizer - word-aligned size_categories: - 10K<n<100K --- # Akan TTS — WavTokenizer Word-Aligned Combined Dataset Combined word-aligned dataset for training Akan/Twi TTS models. Audio encoded with **WavTokenizer** (75Hz, single codebook, codes 0-4095) and word boundaries from **MMS forced alignment**. ## Overview | | | |---|---| | **Total samples** | 96,615 | | **Total hours** | 222.9h | | **Sources** | 5 | ## Splits | Split | Samples | Hours | |-------|---------|-------| | train | 95,165 | 219.4h | | validation | 966 | 2.2h | | test | 484 | 1.2h | ## Sources | Source | Samples | Hours | Avg Duration | Median Duration | |--------|---------|-------|--------------|-----------------| | akuapem-twi-tts | 25,483 | 60.2h | 8.5s | 7.9s | | asante-twi-tts | 28,538 | 73.1h | 9.2s | 8.5s | | twi-multispeaker | 28,048 | 14.1h | 1.8s | 1.7s | | waxalnlp-aka-asr | 12,752 | 65.8h | 18.6s | 18.0s | | waxalnlp-twi-tts | 1,794 | 9.6h | 19.3s | 19.4s | ## Duration Statistics | | | |---|---| | **Min** | 0.2s | | **Max** | 35.0s | | **Mean** | 8.3s | | **Median** | 6.9s | | **Std** | 6.4s | ## Words per Sample | | | |---|---| | **Min** | 1 | | **Max** | 135 | | **Mean** | 19.5 | | **Median** | 18 | ## Schema | Column | Type | Description | |--------|------|-------------| | `text` | string | Original transcription (with diacritics) | | `words_aligned` | string (JSON) | `[{"word", "duration", "codes"}]` — word-level WavTokenizer codes | | `source` | string | Dataset identifier | ## Encoding Pipeline 1. Audio resampled to 24kHz mono 2. **WavTokenizer** (`wavtokenizer_large_speech_320_24k`) encodes audio to discrete codes at 75 tokens/sec 3. **MMS forced alignment** (`torchaudio.pipelines.MMS_FA`) aligns text to audio at word level 4. Each word gets: romanized text, duration (seconds), and WavTokenizer code sequence 5. Long audio (>35s) split at sentence boundaries using FA word timings 6. Audio preprocessing: VAD silence trimming, edge click removal (position-aware for chunks) ## Usage ```python from datasets import load_dataset import json ds = load_dataset("ik/akan-tts-wavtokenizer-combined") sample = ds["train"][0] words = json.loads(sample["words_aligned"]) # words = [{"word": "wo", "duration": 0.45, "codes": [123, 456, ...]}, ...] ``` ## License CC-BY-SA-4.0 (inherits from source datasets)

提供机构：

5,000+

优质数据集

54 个

任务类型

进入经典数据集