reference_ai_voices_with_timbre_annotations
收藏魔搭社区2025-12-05 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/laion/reference_ai_voices_with_timbre_annotations
下载链接
链接失效反馈官方服务:
资源简介:
## Overview
`reference_voice_dataset__mp3` is a collection of around **32,000 AI-generated voice samples**.
The clips were created with different neural TTS / voice-conversion tools and are designed to
cover a **broad emotional spectrum** and a **wide variety of vocal timbres and character types**.
Voices span:
- Young adults to elderly speakers
- Masculine, feminine, and androgynous presentations
- Dark vs. bright, soft vs. harsh, warm vs. cool timbres
- Neutral, everyday voices and highly stylised character voices, including
fairy-like characters, evil overlords, dragon kings, princesses, and more
The core goal of this dataset is to provide **rich training material for timbre modeling and controllable TTS**.
## Contents
This repository currently hosts a single archive:
- `reference_voice_dataset__mp3.tar`
- Contains ~32k short MP3 clips (one voice per clip).
- Each clip is paired with a JSON sidecar file (same basename + `.json`) containing
a `timbre_annotation` object produced by an automated annotation script
(see description below).
The `timbre_annotation` JSON structure includes:
- `trait_tags`
Discrete labels for **stable vocal traits**, for example:
- perceived age and gender expression
- body size impression
- pitch level
- timbre brightness (dark ↔ bright)
- timbre softness/harshness
- timbre warmth (warm ↔ cool)
- timbre clarity/roughness
- nasality
- breathiness
- phonation type (modal, breathy, creaky, pressed, etc.)
- resonance placement (chest / head / mask)
- vocal health (clear ↔ hoarse/strained/smoky)
- baseline tension (relaxed ↔ very tense)
- articulation clarity (very clear ↔ heavily mumbled)
- baseline speech rate (very slow ↔ very fast)
- accent region (e.g. neutral, US/UK regional, European, African, Asian accents, etc.)
- language register (very casual ↔ very formal)
- `context_tags`
Three lists of role tags, each in snake_case:
- `fantasy`: roles such as `gentle_elf_healer`, `aged_demon_lord`, `young_fairy_princess`, etc.
- `science_fiction`: roles such as `warm_starship_ai`, `ruthless_space_pirate`, `calm_mission_controller`, etc.
- `contemporary`: roles such as `kind_elementary_teacher`, `stern_officer`, `sarcastic_office_coworker`, etc.
- `trait_caption`
2–4 sentences describing the **stable voice identity** (age impression, gender expression,
timbre, accent, speech rate, articulation, etc.), without referring to specific roles.
- `casting_caption`
2–4 sentences describing **which kinds of characters** this voice is a good fit for across
fantasy, science-fiction, and contemporary settings.
- `listening_pleasantness`
A 5-level label from `very_unpleasant` to `very_pleasant`, describing how enjoyable the voice is
to listen to.
- `voice_commonness`
A 3-level label from `common_voice` to `very_unusual_voice`, describing how typical or
distinctive the voice is in everyday life.
Each annotation is generated automatically by a script similar to `annotate-timbre.py`,
using a structured-output large language model with a well-specified Pydantic schema.
The JSON example in the repository illustrates the full structure of a `timbre_annotation`.
moving to real human recordings.
## How to Use
1. Download and extract:
```bash
git lfs install
git clone https://huggingface.co/datasets/laion/reference_voice_dataset__mp3
cd reference_voice_dataset__mp3
tar -xf reference_voice_dataset__mp3.tar
2. Each MP3 file <basename>.mp3 should have a corresponding <basename>.json
file with a timbre_annotation object.
You can then:
Parse the JSON into your own data structures,
## 数据集概览
`reference_voice_dataset__mp3` 是一款包含约32000条AI生成语音样本的数据集。所有语音片段均由多款不同的神经文本转语音(neural TTS)/声音转换工具生成,旨在覆盖**宽泛的情感光谱**与**丰富多样的嗓音音色及角色类型**。
本数据集涵盖的语音类型包括:
- 从青年到老年的发声者
- 男性化、女性化及中性的声线表现
- 暗沉与明亮、柔和与尖锐、温暖与清冷的音色特质
- 中性日常嗓音与高度风格化的角色嗓音,涵盖仙灵类角色、邪恶君主、龙王、公主等多种类型
该数据集的核心目标是为**音色建模与可控式文本转语音**提供丰富的训练素材。
## 数据集内容
本仓库目前仅托管一个归档文件:
- `reference_voice_dataset__mp3.tar`
- 内含约32000条短时长MP3音频片段,每条对应单一嗓音。
- 每条音频均配有同名JSON辅助文件(文件名一致,后缀为`.json`),其中包含由自动标注脚本生成的`timbre_annotation`(音色标注)对象(详见下文说明)。
### `timbre_annotation` JSON结构
`timbre_annotation`的JSON结构包含以下字段:
- `trait_tags`
针对**稳定嗓音特质**的离散标签,例如:
- 感知年龄与性别表现
- 体型印象
- 音高水平
- 音色明暗度(暗沉↔明亮)
- 音色柔和度/尖锐度
- 音色温暖度(温暖↔清冷)
- 音色清晰度/粗糙感
- 鼻音程度
- 气息感
- 发声类型(常态发声、气息发声、嘎声发声、挤压发声等)
- 共鸣位置(胸腔/头腔/面罩共鸣)
- 嗓音健康状态(清晰↔沙哑/紧绷/烟熏嗓)
- 基础张力(放松↔极度紧绷)
- 咬字清晰度(极度清晰↔严重含糊)
- 基础语速(极慢↔极快)
- 口音区域(例如通用口音、美/英地区口音、欧洲、非洲、亚洲口音等)
- 语域风格(极随意↔极正式)
- `context_tags`
三组采用蛇形命名法的角色标签列表:
- `fantasy`(奇幻类):例如`gentle_elf_healer`(温柔的精灵治疗师)、`aged_demon_lord`(年迈的恶魔君主)、`young_fairy_princess`(年轻的仙灵公主)等
- `science_fiction`(科幻类):例如`warm_starship_ai`(温暖的星舰AI)、`ruthless_space_pirate`(冷酷的太空海盗)、`calm_mission_controller`(沉稳的任务指挥官)等
- `contemporary`(当代类):例如`kind_elementary_teacher`(和蔼的小学教师)、`stern_officer`(严厉的公职人员)、`sarcastic_office_coworker`(爱讽刺的办公室同事)等
- `trait_caption`
2至4句描述**稳定嗓音身份**的文本,涵盖年龄印象、性别表现、音色、口音、语速、咬字等维度,不涉及特定角色。
- `casting_caption`
2至4句描述该嗓音适配**哪些类型的角色**的文本,覆盖奇幻、科幻与当代三种场景。
- `listening_pleasantness`
从`very_unpleasant`(极不悦耳)到`very_pleasant`(极悦耳)的五级标签,用于描述该嗓音的聆听愉悦度。
- `voice_commonness`
从`common_voice`(常见嗓音)到`very_unusual_voice`(极罕见嗓音)的三级标签,用于描述该嗓音在日常生活中的典型性与独特性。
所有标注均由类似`annotate-timbre.py`的脚本自动生成,借助具备明确Pydantic模式(Pydantic schema)的结构化输出大语言模型(Large Language Model)完成。本仓库中的JSON示例完整展示了`timbre_annotation`的结构,后续将转向真实人类录音。
## 使用方法
1. 下载并解压:
bash
git lfs install
git clone https://huggingface.co/datasets/laion/reference_voice_dataset__mp3
cd reference_voice_dataset__mp3
tar -xf reference_voice_dataset__mp3.tar
2. 每个MP3文件`<basename>.mp3`均对应同名的`<basename>.json`文件,其中包含`timbre_annotation`对象。
你可以将该JSON解析为自定义数据结构。
提供机构:
maas
创建时间:
2025-12-01



