TTS-AGI/voice-taxonomy-flash-train
收藏Hugging Face2026-04-08 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/TTS-AGI/voice-taxonomy-flash-train
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
task_categories:
- audio-classification
tags:
- voice
- speech
- taxonomy
- whisper
- gemini
- tts
- voice-attributes
size_categories:
- 10K<n<100K
---
# Voice Taxonomy Fine-tuning Dataset (Gemini Flash)
**36,641 speech samples** annotated with **57 voice taxonomy dimensions** (0-6 ordinal scale) by **Gemini Flash**. Carefully balanced (~100 samples per bucket per dimension) for fine-tuning voice attribute classifiers.
## Related Datasets
| Dataset | Purpose | Link |
|---------|---------|------|
| Pre-training (large, Whisper ensemble) | Pre-training | [TTS-AGI/voice-taxonomy-pretrain](https://huggingface.co/datasets/TTS-AGI/voice-taxonomy-pretrain) |
| **This dataset** | Fine-tuning (balanced, high-quality) | — |
| Validation (Gemini 3.1 Pro gold) | Evaluation | [TTS-AGI/voice-taxonomy-val](https://huggingface.co/datasets/TTS-AGI/voice-taxonomy-val) |
## Format
WebDataset TAR with MP3+JSON pairs:
```
{stem}.mp3 # Audio (mono, 44.1kHz, 64kbps, ≤30s)
{stem}.json # 57-dim taxonomy annotation
```
Each JSON:
```json
{
"AGEV": {"value": 3, "name": "Perceived Age", "label": "young adult"},
"GEND": {"value": 5, "name": "Gender Presentation", "label": "standard masculine"},
"TEMP": {"value": 4, "name": "Tempo", "label": "slightly fast energetic"},
...
}
```
## Training Plan
See [TRAINING_PLAN.md](TRAINING_PLAN.md) for the full training strategy and `train_voice_taxonomy.py` for a self-contained training script.
## Quick Start
```bash
# Download all 3 datasets
huggingface-cli download TTS-AGI/voice-taxonomy-pretrain --local-dir pretrain
huggingface-cli download TTS-AGI/voice-taxonomy-flash-train --local-dir finetune
huggingface-cli download TTS-AGI/voice-taxonomy-val --local-dir val
# Fine-tune (after pre-training)
python train_voice_taxonomy.py --phase finetune --encoder laion/BUD-E-Whisper --gpu 0 \
--resume checkpoints/pretrain_best.pt \
--finetune-tar finetune/voice_taxonomy_flash_train.tar \
--val-tar val/voice_taxonomy_val.tar
```
## Balancing Strategy
Samples were selected to maximize coverage across all 57 × 7 = 399 buckets:
- Up to 100 samples per bucket per dimension
- Files deduplicated across dimensions
- Validation set files excluded
- Total: 36,641 unique files from 318K candidates
## Labels
Labels were generated by **Gemini 2.0 Flash** via multimodal audio annotation with a detailed system prompt covering all 57 dimensions. Anti-center-bias instructions ensure good distribution across the 0-6 scale.
## Taxonomy
57 dimensions covering: speaker identity, timbral quality, resonance placement, prosody, articulation, emotion, and speaking style. See `taxonomy_labels.json` for full definitions.
提供机构:
TTS-AGI



