SpeechPPL/SALMon_Spirit-LM-Expressive-normalized
收藏Hugging Face2026-04-10 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/SpeechPPL/SALMon_Spirit-LM-Expressive-normalized
下载链接
链接失效反馈官方服务:
资源简介:
---
configs:
- config_name: bg_alignment
data_files:
- split: train
path: bg_alignment/train-*
- config_name: bg_all_consistency
data_files:
- split: train
path: bg_all_consistency/train-*
- config_name: bg_domain_consistency
data_files:
- split: train
path: bg_domain_consistency/train-*
- config_name: gender_consistency
data_files:
- split: train
path: gender_consistency/train-*
- config_name: rir_consistency
data_files:
- split: train
path: rir_consistency/train-*
- config_name: sentiment_alignment
data_files:
- split: train
path: sentiment_alignment/train-*
- config_name: sentiment_consistency
data_files:
- split: train
path: sentiment_consistency/train-*
- config_name: speaker_consistency
data_files:
- split: train
path: speaker_consistency/train-*
dataset_info:
- config_name: bg_alignment
features:
- name: task
dtype: string
- name: ind
dtype: int64
- name: positive_audio
dtype: audio
- name: negative_audio
dtype: audio
- name: prompt_audio
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_positive
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_negative
dtype:
audio:
sampling_rate: 16000
- name: negative_audio_sanity
dtype:
audio:
sampling_rate: 16000
- name: positive_sample_tokenwise_loss
sequence: float32
- name: negative_sample_tokenwise_loss
sequence: float32
- name: code_frame_rate
dtype: int64
- name: code_depth
dtype: int64
- name: model_sampling_rate
dtype: int64
- name: ppl_sanity
dtype: int64
- name: positive_sample_raw_units
list:
- name: hubert
dtype: string
- name: pitch
dtype: string
- name: style
dtype: string
- name: negative_sample_raw_units
list:
- name: hubert
dtype: string
- name: pitch
dtype: string
- name: style
dtype: string
- name: model_generated_continuation
dtype:
audio:
sampling_rate: 16000
splits:
- name: train
num_bytes: 86708136
num_examples: 200
download_size: 86708136
dataset_size: 86708136
- config_name: bg_all_consistency
features:
- name: task
dtype: string
- name: ind
dtype: int64
- name: positive_audio
dtype: audio
- name: negative_audio
dtype: audio
- name: audio_transition_s
dtype: int64
- name: prompt_audio
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_positive
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_negative
dtype:
audio:
sampling_rate: 16000
- name: negative_audio_sanity
dtype:
audio:
sampling_rate: 16000
- name: positive_sample_tokenwise_loss
sequence: float32
- name: negative_sample_tokenwise_loss
sequence: float32
- name: positive_continuation_tokenwise_loss
sequence: float32
- name: negative_continuation_tokenwise_loss
sequence: float32
- name: prompt_sample_tokenwise_loss
sequence: float32
- name: model_generated_continuation
dtype:
audio:
sampling_rate: 16000
- name: code_frame_rate
dtype: int64
- name: code_depth
dtype: int64
- name: model_sampling_rate
dtype: int64
- name: ppl_sanity
dtype: int64
- name: positive_sample_raw_units
list:
- name: hubert
dtype: string
- name: pitch
dtype: string
- name: style
dtype: string
- name: negative_sample_raw_units
list:
- name: hubert
dtype: string
- name: pitch
dtype: string
- name: style
dtype: string
- name: positive_continuation_raw_units
list:
- name: hubert
dtype: string
- name: pitch
dtype: string
- name: style
dtype: string
- name: negative_continuation_raw_units
list:
- name: hubert
dtype: string
- name: pitch
dtype: string
- name: style
dtype: string
splits:
- name: train
num_bytes: 222443312
num_examples: 200
download_size: 222443312
dataset_size: 222443312
- config_name: bg_domain_consistency
features:
- name: task
dtype: string
- name: ind
dtype: int64
- name: positive_audio
dtype: audio
- name: negative_audio
dtype: audio
- name: audio_transition_s
dtype: int64
- name: prompt_audio
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_positive
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_negative
dtype:
audio:
sampling_rate: 16000
- name: negative_audio_sanity
dtype:
audio:
sampling_rate: 16000
- name: positive_sample_tokenwise_loss
sequence: float32
- name: negative_sample_tokenwise_loss
sequence: float32
- name: positive_continuation_tokenwise_loss
sequence: float32
- name: negative_continuation_tokenwise_loss
sequence: float32
- name: prompt_sample_tokenwise_loss
sequence: float32
- name: model_generated_continuation
dtype:
audio:
sampling_rate: 16000
- name: code_frame_rate
dtype: int64
- name: code_depth
dtype: int64
- name: model_sampling_rate
dtype: int64
- name: ppl_sanity
dtype: int64
- name: positive_sample_raw_units
list:
- name: hubert
dtype: string
- name: pitch
dtype: string
- name: style
dtype: string
- name: negative_sample_raw_units
list:
- name: hubert
dtype: string
- name: pitch
dtype: string
- name: style
dtype: string
- name: positive_continuation_raw_units
list:
- name: hubert
dtype: string
- name: pitch
dtype: string
- name: style
dtype: string
- name: negative_continuation_raw_units
list:
- name: hubert
dtype: string
- name: pitch
dtype: string
- name: style
dtype: string
splits:
- name: train
num_bytes: 226172124
num_examples: 200
download_size: 226172124
dataset_size: 226172124
- config_name: gender_consistency
features:
- name: task
dtype: string
- name: ind
dtype: int64
- name: positive_audio
dtype: audio
- name: negative_audio
dtype: audio
- name: audio_transition_s
dtype: int64
- name: prompt_audio
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_positive
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_negative
dtype:
audio:
sampling_rate: 16000
- name: negative_audio_sanity
dtype:
audio:
sampling_rate: 16000
- name: positive_sample_tokenwise_loss
sequence: float32
- name: negative_sample_tokenwise_loss
sequence: float32
- name: positive_continuation_tokenwise_loss
sequence: float32
- name: negative_continuation_tokenwise_loss
sequence: float32
- name: prompt_sample_tokenwise_loss
sequence: float32
- name: model_generated_continuation
dtype:
audio:
sampling_rate: 16000
- name: code_frame_rate
dtype: int64
- name: code_depth
dtype: int64
- name: model_sampling_rate
dtype: int64
- name: ppl_sanity
dtype: int64
- name: positive_sample_raw_units
list:
- name: hubert
dtype: string
- name: pitch
dtype: string
- name: style
dtype: string
- name: negative_sample_raw_units
list:
- name: hubert
dtype: string
- name: pitch
dtype: string
- name: style
dtype: string
- name: positive_continuation_raw_units
list:
- name: hubert
dtype: string
- name: pitch
dtype: string
- name: style
dtype: string
- name: negative_continuation_raw_units
list:
- name: hubert
dtype: string
- name: pitch
dtype: string
- name: style
dtype: string
splits:
- name: train
num_bytes: 228058502
num_examples: 200
download_size: 228058502
dataset_size: 228058502
- config_name: rir_consistency
features:
- name: task
dtype: string
- name: ind
dtype: int64
- name: positive_audio
dtype: audio
- name: negative_audio
dtype: audio
- name: audio_transition_s
dtype: int64
- name: prompt_audio
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_positive
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_negative
dtype:
audio:
sampling_rate: 16000
- name: negative_audio_sanity
dtype:
audio:
sampling_rate: 16000
- name: positive_sample_tokenwise_loss
sequence: float32
- name: negative_sample_tokenwise_loss
sequence: float32
- name: positive_continuation_tokenwise_loss
sequence: float32
- name: negative_continuation_tokenwise_loss
sequence: float32
- name: prompt_sample_tokenwise_loss
sequence: float32
- name: model_generated_continuation
dtype:
audio:
sampling_rate: 16000
- name: code_frame_rate
dtype: int64
- name: code_depth
dtype: int64
- name: model_sampling_rate
dtype: int64
- name: ppl_sanity
dtype: int64
- name: positive_sample_raw_units
list:
- name: hubert
dtype: string
- name: pitch
dtype: string
- name: style
dtype: string
- name: negative_sample_raw_units
list:
- name: hubert
dtype: string
- name: pitch
dtype: string
- name: style
dtype: string
- name: positive_continuation_raw_units
list:
- name: hubert
dtype: string
- name: pitch
dtype: string
- name: style
dtype: string
- name: negative_continuation_raw_units
list:
- name: hubert
dtype: string
- name: pitch
dtype: string
- name: style
dtype: string
splits:
- name: train
num_bytes: 202444443
num_examples: 200
download_size: 202444443
dataset_size: 202444443
- config_name: sentiment_alignment
features:
- name: task
dtype: string
- name: ind
dtype: int64
- name: positive_audio
dtype: audio
- name: negative_audio
dtype: audio
- name: prompt_audio
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_positive
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_negative
dtype:
audio:
sampling_rate: 16000
- name: negative_audio_sanity
dtype:
audio:
sampling_rate: 16000
- name: positive_sample_tokenwise_loss
sequence: float32
- name: negative_sample_tokenwise_loss
sequence: float32
- name: code_frame_rate
dtype: int64
- name: code_depth
dtype: int64
- name: model_sampling_rate
dtype: int64
- name: ppl_sanity
dtype: int64
- name: positive_sample_raw_units
list:
- name: hubert
dtype: string
- name: pitch
dtype: string
- name: style
dtype: string
- name: negative_sample_raw_units
list:
- name: hubert
dtype: string
- name: pitch
dtype: string
- name: style
dtype: string
- name: model_generated_continuation
dtype:
audio:
sampling_rate: 16000
splits:
- name: train
num_bytes: 46555074
num_examples: 200
download_size: 46555074
dataset_size: 46555074
- config_name: sentiment_consistency
features:
- name: task
dtype: string
- name: ind
dtype: int64
- name: positive_audio
dtype: audio
- name: negative_audio
dtype: audio
- name: audio_transition_s
dtype: int64
- name: prompt_audio
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_positive
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_negative
dtype:
audio:
sampling_rate: 16000
- name: negative_audio_sanity
dtype:
audio:
sampling_rate: 16000
- name: positive_sample_tokenwise_loss
sequence: float32
- name: negative_sample_tokenwise_loss
sequence: float32
- name: positive_continuation_tokenwise_loss
sequence: float32
- name: negative_continuation_tokenwise_loss
sequence: float32
- name: prompt_sample_tokenwise_loss
sequence: float32
- name: model_generated_continuation
dtype:
audio:
sampling_rate: 16000
- name: code_frame_rate
dtype: int64
- name: code_depth
dtype: int64
- name: model_sampling_rate
dtype: int64
- name: ppl_sanity
dtype: int64
- name: positive_sample_raw_units
list:
- name: hubert
dtype: string
- name: pitch
dtype: string
- name: style
dtype: string
- name: negative_sample_raw_units
list:
- name: hubert
dtype: string
- name: pitch
dtype: string
- name: style
dtype: string
- name: positive_continuation_raw_units
list:
- name: hubert
dtype: string
- name: pitch
dtype: string
- name: style
dtype: string
- name: negative_continuation_raw_units
list:
- name: hubert
dtype: string
- name: pitch
dtype: string
- name: style
dtype: string
splits:
- name: train
num_bytes: 223684769
num_examples: 200
download_size: 223684769
dataset_size: 223684769
- config_name: speaker_consistency
features:
- name: task
dtype: string
- name: ind
dtype: int64
- name: positive_audio
dtype: audio
- name: negative_audio
dtype: audio
- name: audio_transition_s
dtype: int64
- name: prompt_audio
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_positive
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_negative
dtype:
audio:
sampling_rate: 16000
- name: negative_audio_sanity
dtype:
audio:
sampling_rate: 16000
- name: positive_sample_tokenwise_loss
sequence: float32
- name: negative_sample_tokenwise_loss
sequence: float32
- name: positive_continuation_tokenwise_loss
sequence: float32
- name: negative_continuation_tokenwise_loss
sequence: float32
- name: prompt_sample_tokenwise_loss
sequence: float32
- name: model_generated_continuation
dtype:
audio:
sampling_rate: 16000
- name: code_frame_rate
dtype: int64
- name: code_depth
dtype: int64
- name: model_sampling_rate
dtype: int64
- name: ppl_sanity
dtype: int64
- name: positive_sample_raw_units
list:
- name: hubert
dtype: string
- name: pitch
dtype: string
- name: style
dtype: string
- name: negative_sample_raw_units
list:
- name: hubert
dtype: string
- name: pitch
dtype: string
- name: style
dtype: string
- name: positive_continuation_raw_units
list:
- name: hubert
dtype: string
- name: pitch
dtype: string
- name: style
dtype: string
- name: negative_continuation_raw_units
list:
- name: hubert
dtype: string
- name: pitch
dtype: string
- name: style
dtype: string
splits:
- name: train
num_bytes: 228183961
num_examples: 200
download_size: 228183961
dataset_size: 228183961
---
# SALMon Normalized Dataset
This repo preserves the SALMon per-config folder layout while normalizing
mismatched schema details across model families.
提供机构:
SpeechPPL
搜集汇总
数据集介绍

构建方式
在语音合成与说话人建模领域,评估模型对声学属性的控制能力至关重要。SALMon_Spirit-LM-Expressive-normalized数据集专为此目的而构建,它源自SALMon原始数据集,但进行了关键的归一化处理。具体而言,该数据集在保留原始SALMon按配置(config)划分的文件夹布局基础上,通过统一不同模型家族间不匹配的模式(schema)细节,实现了数据结构的一致性。每个配置(如bg_alignment、speaker_consistency等)均包含200个训练样本,样本内提供了正负样本对的音频、相应的声学单元(hubert、pitch、style)以及模型生成的语言延续音频等结构化信息,为精细评测奠定了基础。
特点
该数据集的核心特色在于其对表达性语音控制的细粒度覆盖。它精心设计了八个独立配置,分别针对背景噪声对齐、各类一致性(如背景全一致性、域一致性、性别一致性、混响一致性、情感一致性、说话人一致性)以及情感对齐进行评测。每个样本均包含正/负样本对的逐词损失(tokenwise loss)和原始声学单元,从而允许研究者深入分析模型在保持说话人身份、情感、音高等属性时的表现。音频数据统一采用16kHz采样率,确保了跨配置评测的标准化与可比性。
使用方法
研究者可通过HuggingFace Datasets库便捷地加载此数据集。使用时需指定目标配置名称,例如`load_dataset('SALMon_Spirit-LM-Expressive-normalized', 'speaker_consistency', split='train')`。每个样本中的`positive_audio`与`negative_audio`字段提供了用于对比的音频片段,而`prompt_audio`及其对应的正负延续音频则支持对模型条件生成能力的评估。此外,`positive_sample_tokenwise_loss`等损失字段可用于计算不同属性下的偏好准确性,从而系统性地度量语音语言模型在表达性控制上的鲁棒性。
背景与挑战
背景概述
SALMon_Spirit-LM-Expressive-normalized数据集由研究团队构建,旨在评估和提升语音语言模型在多维表达性维度上的一致性。该数据集基于SALMon框架,通过细粒度的任务配置(如情感对齐、说话人一致性等)系统性地检测模型在音色、基频、风格等声学单元上的表现。其核心研究问题聚焦于当前生成式语音模型在执行连续语音生成时,能否保持背景音、情感、性别、房间脉冲响应等属性的稳定传递与对齐。作为Spirit-LM系列的关键评测资源,该数据集为分析语音模型在表达性控制方面的缺陷提供了标准化的参照基准,推动了可控语音生成领域的发展。
当前挑战
该数据集面临的挑战主要涵盖两大层面。在领域问题层面,语音模型常出现语义与副语言特征解耦失败的现象,表现为生成语音在情感、性别或背景声等属性上的突变或丢失,使得模型难以兼顾内容丰富性与表达一致性。在构建过程中,设计人员需要为多个声学属性(如HuBERT单元、基频、风格)构建正负样本对,并确保不同任务间的数据格式统一,这对数据标注的精度和跨模态对齐提出了严苛要求。此外,每个配置仅含200条样本的规模限制,也考验了模型在小样本条件下的泛化能力。
常用场景
经典使用场景
SALMon_Spirit-LM-Expressive-normalized数据集在语音语言模型的研究领域中扮演着重要角色,其最经典的用途在于评估和提升模型对语音表达一致性的把控能力。该数据集精心设计了包括背景对齐、情感一致性、说话人一致性、性别一致性等多个子任务配置,每个配置均包含正负样本对及对应的音频和词元级损失。这使得研究者能够系统地测试模型在生成连续语音时,是否能在音色、情感、背景噪声等维度上保持与提示音频的高度一致,从而成为衡量和优化语音语言模型表达鲁棒性的核心基准。
衍生相关工作
该数据集衍生了一系列重要的学术工作,其核心影响力在于激发了针对语音语言模型表达一致性的专项研究。围绕其定义的各类一致性任务,研究者们提出了多种改进模型,例如引入显式的说话人嵌入条件化机制或设计基于对比学习的损失函数,以强化模型在长时生成中对音色和情感特征的保持能力。此外,该数据集还被用作评估指标,在诸多探索音频离散表示与语音生成可控性的工作中被广泛引用。这些衍生研究共同推动了语音生成从‘能说话’向‘说得好、说得稳’的重要演进,深化了学界对语音表达内在连贯性的理解。
数据集最近研究
最新研究方向
SALMon_Spirit-LM-Expressive-normalized数据集代表了神经音频生成领域对细粒度表达控制与声学一致性深究的前沿探索。通过解耦背景噪声、领域、性别、混响及情感等多元声学要素,该数据集构建了一套精细化的正负样本对齐与一致性评估体系,为语音生成模型的鲁棒性与可控性设立了新的基准。其多维度的一致性约束机制,尤其是对HuBERT离散单元及音高、风格的联合建模,为生成式语音模型在保持情感连贯、说话人身份稳定及声学环境真实方面的突破性研究铺平了道路,深刻影响了视听媒介下合成语音的自然度与可信度评估范式。
以上内容由遇见数据集搜集并总结生成



