TTS-AGI/voice-taxonomy-flash-train

Name: TTS-AGI/voice-taxonomy-flash-train
Creator: TTS-AGI
Published: 2026-04-08 07:06:05
License: 暂无描述

Hugging Face2026-04-08 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/TTS-AGI/voice-taxonomy-flash-train

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 task_categories: - audio-classification tags: - voice - speech - taxonomy - whisper - gemini - tts - voice-attributes size_categories: - 10K<n<100K --- # Voice Taxonomy Fine-tuning Dataset (Gemini Flash) **36,641 speech samples** annotated with **57 voice taxonomy dimensions** (0-6 ordinal scale) by **Gemini Flash**. Carefully balanced (~100 samples per bucket per dimension) for fine-tuning voice attribute classifiers. ## Related Datasets | Dataset | Purpose | Link | |---------|---------|------| | Pre-training (large, Whisper ensemble) | Pre-training | [TTS-AGI/voice-taxonomy-pretrain](https://huggingface.co/datasets/TTS-AGI/voice-taxonomy-pretrain) | | **This dataset** | Fine-tuning (balanced, high-quality) | — | | Validation (Gemini 3.1 Pro gold) | Evaluation | [TTS-AGI/voice-taxonomy-val](https://huggingface.co/datasets/TTS-AGI/voice-taxonomy-val) | ## Format WebDataset TAR with MP3+JSON pairs: ``` {stem}.mp3 # Audio (mono, 44.1kHz, 64kbps, ≤30s) {stem}.json # 57-dim taxonomy annotation ``` Each JSON: ```json { "AGEV": {"value": 3, "name": "Perceived Age", "label": "young adult"}, "GEND": {"value": 5, "name": "Gender Presentation", "label": "standard masculine"}, "TEMP": {"value": 4, "name": "Tempo", "label": "slightly fast energetic"}, ... } ``` ## Training Plan See [TRAINING_PLAN.md](TRAINING_PLAN.md) for the full training strategy and `train_voice_taxonomy.py` for a self-contained training script. ## Quick Start ```bash # Download all 3 datasets huggingface-cli download TTS-AGI/voice-taxonomy-pretrain --local-dir pretrain huggingface-cli download TTS-AGI/voice-taxonomy-flash-train --local-dir finetune huggingface-cli download TTS-AGI/voice-taxonomy-val --local-dir val # Fine-tune (after pre-training) python train_voice_taxonomy.py --phase finetune --encoder laion/BUD-E-Whisper --gpu 0 \ --resume checkpoints/pretrain_best.pt \ --finetune-tar finetune/voice_taxonomy_flash_train.tar \ --val-tar val/voice_taxonomy_val.tar ``` ## Balancing Strategy Samples were selected to maximize coverage across all 57 × 7 = 399 buckets: - Up to 100 samples per bucket per dimension - Files deduplicated across dimensions - Validation set files excluded - Total: 36,641 unique files from 318K candidates ## Labels Labels were generated by **Gemini 2.0 Flash** via multimodal audio annotation with a detailed system prompt covering all 57 dimensions. Anti-center-bias instructions ensure good distribution across the 0-6 scale. ## Taxonomy 57 dimensions covering: speaker identity, timbral quality, resonance placement, prosody, articulation, emotion, and speaking style. See `taxonomy_labels.json` for full definitions.

提供机构：

TTS-AGI

5,000+

优质数据集

54 个

任务类型

进入经典数据集