reference_ai_voices_with_timbre_annotations

Name: reference_ai_voices_with_timbre_annotations
Creator: maas
Published: 2025-12-05 16:57:45
License: 暂无描述

魔搭社区2025-12-05 更新2025-12-06 收录

下载链接：

https://modelscope.cn/datasets/laion/reference_ai_voices_with_timbre_annotations

下载链接

链接失效反馈

官方服务：

资源简介：

## Overview `reference_voice_dataset__mp3` is a collection of around **32,000 AI-generated voice samples**. The clips were created with different neural TTS / voice-conversion tools and are designed to cover a **broad emotional spectrum** and a **wide variety of vocal timbres and character types**. Voices span: - Young adults to elderly speakers - Masculine, feminine, and androgynous presentations - Dark vs. bright, soft vs. harsh, warm vs. cool timbres - Neutral, everyday voices and highly stylised character voices, including fairy-like characters, evil overlords, dragon kings, princesses, and more The core goal of this dataset is to provide **rich training material for timbre modeling and controllable TTS**. ## Contents This repository currently hosts a single archive: - `reference_voice_dataset__mp3.tar` - Contains ~32k short MP3 clips (one voice per clip). - Each clip is paired with a JSON sidecar file (same basename + `.json`) containing a `timbre_annotation` object produced by an automated annotation script (see description below). The `timbre_annotation` JSON structure includes: - `trait_tags` Discrete labels for **stable vocal traits**, for example: - perceived age and gender expression - body size impression - pitch level - timbre brightness (dark ↔ bright) - timbre softness/harshness - timbre warmth (warm ↔ cool) - timbre clarity/roughness - nasality - breathiness - phonation type (modal, breathy, creaky, pressed, etc.) - resonance placement (chest / head / mask) - vocal health (clear ↔ hoarse/strained/smoky) - baseline tension (relaxed ↔ very tense) - articulation clarity (very clear ↔ heavily mumbled) - baseline speech rate (very slow ↔ very fast) - accent region (e.g. neutral, US/UK regional, European, African, Asian accents, etc.) - language register (very casual ↔ very formal) - `context_tags` Three lists of role tags, each in snake_case: - `fantasy`: roles such as `gentle_elf_healer`, `aged_demon_lord`, `young_fairy_princess`, etc. - `science_fiction`: roles such as `warm_starship_ai`, `ruthless_space_pirate`, `calm_mission_controller`, etc. - `contemporary`: roles such as `kind_elementary_teacher`, `stern_officer`, `sarcastic_office_coworker`, etc. - `trait_caption` 2–4 sentences describing the **stable voice identity** (age impression, gender expression, timbre, accent, speech rate, articulation, etc.), without referring to specific roles. - `casting_caption` 2–4 sentences describing **which kinds of characters** this voice is a good fit for across fantasy, science-fiction, and contemporary settings. - `listening_pleasantness` A 5-level label from `very_unpleasant` to `very_pleasant`, describing how enjoyable the voice is to listen to. - `voice_commonness` A 3-level label from `common_voice` to `very_unusual_voice`, describing how typical or distinctive the voice is in everyday life. Each annotation is generated automatically by a script similar to `annotate-timbre.py`, using a structured-output large language model with a well-specified Pydantic schema. The JSON example in the repository illustrates the full structure of a `timbre_annotation`. moving to real human recordings. ## How to Use 1. Download and extract: ```bash git lfs install git clone https://huggingface.co/datasets/laion/reference_voice_dataset__mp3 cd reference_voice_dataset__mp3 tar -xf reference_voice_dataset__mp3.tar 2. Each MP3 file <basename>.mp3 should have a corresponding <basename>.json file with a timbre_annotation object. You can then: Parse the JSON into your own data structures,

## 数据集概览 `reference_voice_dataset__mp3` 是一款包含约32000条AI生成语音样本的数据集。所有语音片段均由多款不同的神经文本转语音（neural TTS）/声音转换工具生成，旨在覆盖**宽泛的情感光谱**与**丰富多样的嗓音音色及角色类型**。本数据集涵盖的语音类型包括： - 从青年到老年的发声者 - 男性化、女性化及中性的声线表现 - 暗沉与明亮、柔和与尖锐、温暖与清冷的音色特质 - 中性日常嗓音与高度风格化的角色嗓音，涵盖仙灵类角色、邪恶君主、龙王、公主等多种类型该数据集的核心目标是为**音色建模与可控式文本转语音**提供丰富的训练素材。 ## 数据集内容本仓库目前仅托管一个归档文件： - `reference_voice_dataset__mp3.tar` - 内含约32000条短时长MP3音频片段，每条对应单一嗓音。 - 每条音频均配有同名JSON辅助文件（文件名一致，后缀为`.json`），其中包含由自动标注脚本生成的`timbre_annotation`（音色标注）对象（详见下文说明）。 ### `timbre_annotation` JSON结构 `timbre_annotation`的JSON结构包含以下字段： - `trait_tags` 针对**稳定嗓音特质**的离散标签，例如： - 感知年龄与性别表现 - 体型印象 - 音高水平 - 音色明暗度（暗沉↔明亮） - 音色柔和度/尖锐度 - 音色温暖度（温暖↔清冷） - 音色清晰度/粗糙感 - 鼻音程度 - 气息感 - 发声类型（常态发声、气息发声、嘎声发声、挤压发声等） - 共鸣位置（胸腔/头腔/面罩共鸣） - 嗓音健康状态（清晰↔沙哑/紧绷/烟熏嗓） - 基础张力（放松↔极度紧绷） - 咬字清晰度（极度清晰↔严重含糊） - 基础语速（极慢↔极快） - 口音区域（例如通用口音、美/英地区口音、欧洲、非洲、亚洲口音等） - 语域风格（极随意↔极正式） - `context_tags` 三组采用蛇形命名法的角色标签列表： - `fantasy`（奇幻类）：例如`gentle_elf_healer`（温柔的精灵治疗师）、`aged_demon_lord`（年迈的恶魔君主）、`young_fairy_princess`（年轻的仙灵公主）等 - `science_fiction`（科幻类）：例如`warm_starship_ai`（温暖的星舰AI）、`ruthless_space_pirate`（冷酷的太空海盗）、`calm_mission_controller`（沉稳的任务指挥官）等 - `contemporary`（当代类）：例如`kind_elementary_teacher`（和蔼的小学教师）、`stern_officer`（严厉的公职人员）、`sarcastic_office_coworker`（爱讽刺的办公室同事）等 - `trait_caption` 2至4句描述**稳定嗓音身份**的文本，涵盖年龄印象、性别表现、音色、口音、语速、咬字等维度，不涉及特定角色。 - `casting_caption` 2至4句描述该嗓音适配**哪些类型的角色**的文本，覆盖奇幻、科幻与当代三种场景。 - `listening_pleasantness` 从`very_unpleasant`（极不悦耳）到`very_pleasant`（极悦耳）的五级标签，用于描述该嗓音的聆听愉悦度。 - `voice_commonness` 从`common_voice`（常见嗓音）到`very_unusual_voice`（极罕见嗓音）的三级标签，用于描述该嗓音在日常生活中的典型性与独特性。所有标注均由类似`annotate-timbre.py`的脚本自动生成，借助具备明确Pydantic模式（Pydantic schema）的结构化输出大语言模型（Large Language Model）完成。本仓库中的JSON示例完整展示了`timbre_annotation`的结构，后续将转向真实人类录音。 ## 使用方法 1. 下载并解压： bash git lfs install git clone https://huggingface.co/datasets/laion/reference_voice_dataset__mp3 cd reference_voice_dataset__mp3 tar -xf reference_voice_dataset__mp3.tar 2. 每个MP3文件`<basename>.mp3`均对应同名的`<basename>.json`文件，其中包含`timbre_annotation`对象。你可以将该JSON解析为自定义数据结构。

提供机构：

maas

创建时间：

2025-12-01

5,000+

优质数据集

54 个

任务类型

进入经典数据集