TTS-AGI/voice-taxonomy-val

Name: TTS-AGI/voice-taxonomy-val
Creator: TTS-AGI
Published: 2026-04-08 07:09:29
License: 暂无描述

Hugging Face2026-04-08 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/TTS-AGI/voice-taxonomy-val

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 task_categories: - audio-classification tags: - voice - speech - taxonomy - whisper - gemini - tts - voice-attributes - evaluation size_categories: - 1K<n<10K --- # Voice Taxonomy Validation Dataset (Gemini 3.1 Pro) **~1,072 speech samples** annotated with **57 voice taxonomy dimensions** (0-6 ordinal scale) by **Gemini 3.1 Pro**. This is the gold-standard evaluation set for voice attribute classifiers. ## Related Datasets | Dataset | Purpose | Link | |---------|---------|------| | Pre-training (large, Whisper ensemble) | Pre-training | [TTS-AGI/voice-taxonomy-pretrain](https://huggingface.co/datasets/TTS-AGI/voice-taxonomy-pretrain) | | Fine-tuning (balanced, Gemini Flash) | Fine-tuning | [TTS-AGI/voice-taxonomy-flash-train](https://huggingface.co/datasets/TTS-AGI/voice-taxonomy-flash-train) | | **This dataset** | Evaluation (gold standard) | — | ## Format WebDataset TAR with MP3+JSON pairs: ``` {stem}.mp3 # Audio (mono, 44.1kHz, 64kbps, ≤30s) {stem}.json # 57-dim taxonomy annotation ``` Each JSON: ```json { "AGEV": {"value": 3, "name": "Perceived Age", "label": "young adult"}, "GEND": {"value": 5, "name": "Gender Presentation", "label": "standard masculine"}, ... } ``` ## Evaluation ```bash # Download huggingface-cli download TTS-AGI/voice-taxonomy-val --local-dir val # Evaluate a trained model python train_voice_taxonomy.py --phase eval --encoder laion/BUD-E-Whisper --gpu 0 \ --resume checkpoints/finetune_best.pt \ --val-tar val/voice_taxonomy_val.tar ``` ## Metrics | Metric | Description | |--------|-------------| | **Exact accuracy** | Prediction == ground truth | | **Adj1 (primary)** | Prediction within ±1 of ground truth | | **Mean difference** | Average |prediction - truth| | ### Baseline Results | Model | Exact | Adj1 | Diff | |-------|-------|------|------| | V1.0 frozen + MLP | 0.235 | 0.633 | 1.40 | | V1.1 frozen + MLP | 0.260 | 0.635 | 1.35 | | V1.0 full finetune | 0.282 | 0.648 | — | | Random baseline | 0.143 | 0.367 | 1.95 | ## Training Plan See [TRAINING_PLAN.md](TRAINING_PLAN.md) for the full training strategy and `train_voice_taxonomy.py` for a self-contained training script. ## Labels Labels were generated by **Gemini 3.1 Pro** — the most capable model in the annotation pipeline. These serve as the gold standard for evaluation. ## Taxonomy 57 dimensions covering: speaker identity, timbral quality, resonance placement, prosody, articulation, emotion, and speaking style. See `taxonomy_labels.json` for full definitions.

提供机构：

TTS-AGI

5,000+

优质数据集

54 个

任务类型

进入经典数据集