five

reference_ai_voices_with_timbre_annotations

收藏
魔搭社区2025-12-05 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/laion/reference_ai_voices_with_timbre_annotations
下载链接
链接失效反馈
官方服务:
资源简介:
## Overview `reference_voice_dataset__mp3` is a collection of around **32,000 AI-generated voice samples**. The clips were created with different neural TTS / voice-conversion tools and are designed to cover a **broad emotional spectrum** and a **wide variety of vocal timbres and character types**. Voices span: - Young adults to elderly speakers - Masculine, feminine, and androgynous presentations - Dark vs. bright, soft vs. harsh, warm vs. cool timbres - Neutral, everyday voices and highly stylised character voices, including fairy-like characters, evil overlords, dragon kings, princesses, and more The core goal of this dataset is to provide **rich training material for timbre modeling and controllable TTS**. ## Contents This repository currently hosts a single archive: - `reference_voice_dataset__mp3.tar` - Contains ~32k short MP3 clips (one voice per clip). - Each clip is paired with a JSON sidecar file (same basename + `.json`) containing a `timbre_annotation` object produced by an automated annotation script (see description below). The `timbre_annotation` JSON structure includes: - `trait_tags` Discrete labels for **stable vocal traits**, for example: - perceived age and gender expression - body size impression - pitch level - timbre brightness (dark ↔ bright) - timbre softness/harshness - timbre warmth (warm ↔ cool) - timbre clarity/roughness - nasality - breathiness - phonation type (modal, breathy, creaky, pressed, etc.) - resonance placement (chest / head / mask) - vocal health (clear ↔ hoarse/strained/smoky) - baseline tension (relaxed ↔ very tense) - articulation clarity (very clear ↔ heavily mumbled) - baseline speech rate (very slow ↔ very fast) - accent region (e.g. neutral, US/UK regional, European, African, Asian accents, etc.) - language register (very casual ↔ very formal) - `context_tags` Three lists of role tags, each in snake_case: - `fantasy`: roles such as `gentle_elf_healer`, `aged_demon_lord`, `young_fairy_princess`, etc. - `science_fiction`: roles such as `warm_starship_ai`, `ruthless_space_pirate`, `calm_mission_controller`, etc. - `contemporary`: roles such as `kind_elementary_teacher`, `stern_officer`, `sarcastic_office_coworker`, etc. - `trait_caption` 2–4 sentences describing the **stable voice identity** (age impression, gender expression, timbre, accent, speech rate, articulation, etc.), without referring to specific roles. - `casting_caption` 2–4 sentences describing **which kinds of characters** this voice is a good fit for across fantasy, science-fiction, and contemporary settings. - `listening_pleasantness` A 5-level label from `very_unpleasant` to `very_pleasant`, describing how enjoyable the voice is to listen to. - `voice_commonness` A 3-level label from `common_voice` to `very_unusual_voice`, describing how typical or distinctive the voice is in everyday life. Each annotation is generated automatically by a script similar to `annotate-timbre.py`, using a structured-output large language model with a well-specified Pydantic schema. The JSON example in the repository illustrates the full structure of a `timbre_annotation`. moving to real human recordings. ## How to Use 1. Download and extract: ```bash git lfs install git clone https://huggingface.co/datasets/laion/reference_voice_dataset__mp3 cd reference_voice_dataset__mp3 tar -xf reference_voice_dataset__mp3.tar 2. Each MP3 file <basename>.mp3 should have a corresponding <basename>.json file with a timbre_annotation object. You can then: Parse the JSON into your own data structures,

## 数据集概览 `reference_voice_dataset__mp3` 是一款包含约32000条AI生成语音样本的数据集。所有语音片段均由多款不同的神经文本转语音(neural TTS)/声音转换工具生成,旨在覆盖**宽泛的情感光谱**与**丰富多样的嗓音音色及角色类型**。 本数据集涵盖的语音类型包括: - 从青年到老年的发声者 - 男性化、女性化及中性的声线表现 - 暗沉与明亮、柔和与尖锐、温暖与清冷的音色特质 - 中性日常嗓音与高度风格化的角色嗓音,涵盖仙灵类角色、邪恶君主、龙王、公主等多种类型 该数据集的核心目标是为**音色建模与可控式文本转语音**提供丰富的训练素材。 ## 数据集内容 本仓库目前仅托管一个归档文件: - `reference_voice_dataset__mp3.tar` - 内含约32000条短时长MP3音频片段,每条对应单一嗓音。 - 每条音频均配有同名JSON辅助文件(文件名一致,后缀为`.json`),其中包含由自动标注脚本生成的`timbre_annotation`(音色标注)对象(详见下文说明)。 ### `timbre_annotation` JSON结构 `timbre_annotation`的JSON结构包含以下字段: - `trait_tags` 针对**稳定嗓音特质**的离散标签,例如: - 感知年龄与性别表现 - 体型印象 - 音高水平 - 音色明暗度(暗沉↔明亮) - 音色柔和度/尖锐度 - 音色温暖度(温暖↔清冷) - 音色清晰度/粗糙感 - 鼻音程度 - 气息感 - 发声类型(常态发声、气息发声、嘎声发声、挤压发声等) - 共鸣位置(胸腔/头腔/面罩共鸣) - 嗓音健康状态(清晰↔沙哑/紧绷/烟熏嗓) - 基础张力(放松↔极度紧绷) - 咬字清晰度(极度清晰↔严重含糊) - 基础语速(极慢↔极快) - 口音区域(例如通用口音、美/英地区口音、欧洲、非洲、亚洲口音等) - 语域风格(极随意↔极正式) - `context_tags` 三组采用蛇形命名法的角色标签列表: - `fantasy`(奇幻类):例如`gentle_elf_healer`(温柔的精灵治疗师)、`aged_demon_lord`(年迈的恶魔君主)、`young_fairy_princess`(年轻的仙灵公主)等 - `science_fiction`(科幻类):例如`warm_starship_ai`(温暖的星舰AI)、`ruthless_space_pirate`(冷酷的太空海盗)、`calm_mission_controller`(沉稳的任务指挥官)等 - `contemporary`(当代类):例如`kind_elementary_teacher`(和蔼的小学教师)、`stern_officer`(严厉的公职人员)、`sarcastic_office_coworker`(爱讽刺的办公室同事)等 - `trait_caption` 2至4句描述**稳定嗓音身份**的文本,涵盖年龄印象、性别表现、音色、口音、语速、咬字等维度,不涉及特定角色。 - `casting_caption` 2至4句描述该嗓音适配**哪些类型的角色**的文本,覆盖奇幻、科幻与当代三种场景。 - `listening_pleasantness` 从`very_unpleasant`(极不悦耳)到`very_pleasant`(极悦耳)的五级标签,用于描述该嗓音的聆听愉悦度。 - `voice_commonness` 从`common_voice`(常见嗓音)到`very_unusual_voice`(极罕见嗓音)的三级标签,用于描述该嗓音在日常生活中的典型性与独特性。 所有标注均由类似`annotate-timbre.py`的脚本自动生成,借助具备明确Pydantic模式(Pydantic schema)的结构化输出大语言模型(Large Language Model)完成。本仓库中的JSON示例完整展示了`timbre_annotation`的结构,后续将转向真实人类录音。 ## 使用方法 1. 下载并解压: bash git lfs install git clone https://huggingface.co/datasets/laion/reference_voice_dataset__mp3 cd reference_voice_dataset__mp3 tar -xf reference_voice_dataset__mp3.tar 2. 每个MP3文件`<basename>.mp3`均对应同名的`<basename>.json`文件,其中包含`timbre_annotation`对象。 你可以将该JSON解析为自定义数据结构。
提供机构:
maas
创建时间:
2025-12-01
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作