ik/akan-tts-wavtokenizer-combined
收藏Hugging Face2026-03-20 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/ik/akan-tts-wavtokenizer-combined
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- ak
- tw
license: cc-by-sa-4.0
tags:
- tts
- speech
- akan
- twi
- wavtokenizer
- word-aligned
size_categories:
- 10K<n<100K
---
# Akan TTS — WavTokenizer Word-Aligned Combined Dataset
Combined word-aligned dataset for training Akan/Twi TTS models. Audio encoded with **WavTokenizer** (75Hz, single codebook, codes 0-4095) and word boundaries from **MMS forced alignment**.
## Overview
| | |
|---|---|
| **Total samples** | 96,615 |
| **Total hours** | 222.9h |
| **Sources** | 5 |
## Splits
| Split | Samples | Hours |
|-------|---------|-------|
| train | 95,165 | 219.4h |
| validation | 966 | 2.2h |
| test | 484 | 1.2h |
## Sources
| Source | Samples | Hours | Avg Duration | Median Duration |
|--------|---------|-------|--------------|-----------------|
| akuapem-twi-tts | 25,483 | 60.2h | 8.5s | 7.9s |
| asante-twi-tts | 28,538 | 73.1h | 9.2s | 8.5s |
| twi-multispeaker | 28,048 | 14.1h | 1.8s | 1.7s |
| waxalnlp-aka-asr | 12,752 | 65.8h | 18.6s | 18.0s |
| waxalnlp-twi-tts | 1,794 | 9.6h | 19.3s | 19.4s |
## Duration Statistics
| | |
|---|---|
| **Min** | 0.2s |
| **Max** | 35.0s |
| **Mean** | 8.3s |
| **Median** | 6.9s |
| **Std** | 6.4s |
## Words per Sample
| | |
|---|---|
| **Min** | 1 |
| **Max** | 135 |
| **Mean** | 19.5 |
| **Median** | 18 |
## Schema
| Column | Type | Description |
|--------|------|-------------|
| `text` | string | Original transcription (with diacritics) |
| `words_aligned` | string (JSON) | `[{"word", "duration", "codes"}]` — word-level WavTokenizer codes |
| `source` | string | Dataset identifier |
## Encoding Pipeline
1. Audio resampled to 24kHz mono
2. **WavTokenizer** (`wavtokenizer_large_speech_320_24k`) encodes audio to discrete codes at 75 tokens/sec
3. **MMS forced alignment** (`torchaudio.pipelines.MMS_FA`) aligns text to audio at word level
4. Each word gets: romanized text, duration (seconds), and WavTokenizer code sequence
5. Long audio (>35s) split at sentence boundaries using FA word timings
6. Audio preprocessing: VAD silence trimming, edge click removal (position-aware for chunks)
## Usage
```python
from datasets import load_dataset
import json
ds = load_dataset("ik/akan-tts-wavtokenizer-combined")
sample = ds["train"][0]
words = json.loads(sample["words_aligned"])
# words = [{"word": "wo", "duration": 0.45, "codes": [123, 456, ...]}, ...]
```
## License
CC-BY-SA-4.0 (inherits from source datasets)
提供机构:
ik



