TTS-AGI/voice-taxonomy-val
收藏Hugging Face2026-04-08 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/TTS-AGI/voice-taxonomy-val
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-4.0
task_categories:
- audio-classification
tags:
- voice
- speech
- taxonomy
- whisper
- gemini
- tts
- voice-attributes
- evaluation
size_categories:
- 1K<n<10K
---
# Voice Taxonomy Validation Dataset (Gemini 3.1 Pro)
**~1,072 speech samples** annotated with **57 voice taxonomy dimensions** (0-6 ordinal scale) by **Gemini 3.1 Pro**. This is the gold-standard evaluation set for voice attribute classifiers.
## Related Datasets
| Dataset | Purpose | Link |
|---------|---------|------|
| Pre-training (large, Whisper ensemble) | Pre-training | [TTS-AGI/voice-taxonomy-pretrain](https://huggingface.co/datasets/TTS-AGI/voice-taxonomy-pretrain) |
| Fine-tuning (balanced, Gemini Flash) | Fine-tuning | [TTS-AGI/voice-taxonomy-flash-train](https://huggingface.co/datasets/TTS-AGI/voice-taxonomy-flash-train) |
| **This dataset** | Evaluation (gold standard) | — |
## Format
WebDataset TAR with MP3+JSON pairs:
```
{stem}.mp3 # Audio (mono, 44.1kHz, 64kbps, ≤30s)
{stem}.json # 57-dim taxonomy annotation
```
Each JSON:
```json
{
"AGEV": {"value": 3, "name": "Perceived Age", "label": "young adult"},
"GEND": {"value": 5, "name": "Gender Presentation", "label": "standard masculine"},
...
}
```
## Evaluation
```bash
# Download
huggingface-cli download TTS-AGI/voice-taxonomy-val --local-dir val
# Evaluate a trained model
python train_voice_taxonomy.py --phase eval --encoder laion/BUD-E-Whisper --gpu 0 \
--resume checkpoints/finetune_best.pt \
--val-tar val/voice_taxonomy_val.tar
```
## Metrics
| Metric | Description |
|--------|-------------|
| **Exact accuracy** | Prediction == ground truth |
| **Adj1 (primary)** | Prediction within ±1 of ground truth |
| **Mean difference** | Average |prediction - truth| |
### Baseline Results
| Model | Exact | Adj1 | Diff |
|-------|-------|------|------|
| V1.0 frozen + MLP | 0.235 | 0.633 | 1.40 |
| V1.1 frozen + MLP | 0.260 | 0.635 | 1.35 |
| V1.0 full finetune | 0.282 | 0.648 | — |
| Random baseline | 0.143 | 0.367 | 1.95 |
## Training Plan
See [TRAINING_PLAN.md](TRAINING_PLAN.md) for the full training strategy and `train_voice_taxonomy.py` for a self-contained training script.
## Labels
Labels were generated by **Gemini 3.1 Pro** — the most capable model in the annotation pipeline. These serve as the gold standard for evaluation.
## Taxonomy
57 dimensions covering: speaker identity, timbral quality, resonance placement, prosody, articulation, emotion, and speaking style. See `taxonomy_labels.json` for full definitions.
提供机构:
TTS-AGI



