SpeechPPL/SALMon_Spirit-LM-Expressive-normalized

Name: SpeechPPL/SALMon_Spirit-LM-Expressive-normalized
Creator: SpeechPPL
Published: 2026-04-10 14:18:33
License: 暂无描述

Hugging Face2026-04-10 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/SpeechPPL/SALMon_Spirit-LM-Expressive-normalized

下载链接

链接失效反馈

官方服务：

资源简介：

--- configs: - config_name: bg_alignment data_files: - split: train path: bg_alignment/train-* - config_name: bg_all_consistency data_files: - split: train path: bg_all_consistency/train-* - config_name: bg_domain_consistency data_files: - split: train path: bg_domain_consistency/train-* - config_name: gender_consistency data_files: - split: train path: gender_consistency/train-* - config_name: rir_consistency data_files: - split: train path: rir_consistency/train-* - config_name: sentiment_alignment data_files: - split: train path: sentiment_alignment/train-* - config_name: sentiment_consistency data_files: - split: train path: sentiment_consistency/train-* - config_name: speaker_consistency data_files: - split: train path: speaker_consistency/train-* dataset_info: - config_name: bg_alignment features: - name: task dtype: string - name: ind dtype: int64 - name: positive_audio dtype: audio - name: negative_audio dtype: audio - name: prompt_audio dtype: audio: sampling_rate: 16000 - name: continuation_audio_positive dtype: audio: sampling_rate: 16000 - name: continuation_audio_negative dtype: audio: sampling_rate: 16000 - name: negative_audio_sanity dtype: audio: sampling_rate: 16000 - name: positive_sample_tokenwise_loss sequence: float32 - name: negative_sample_tokenwise_loss sequence: float32 - name: code_frame_rate dtype: int64 - name: code_depth dtype: int64 - name: model_sampling_rate dtype: int64 - name: ppl_sanity dtype: int64 - name: positive_sample_raw_units list: - name: hubert dtype: string - name: pitch dtype: string - name: style dtype: string - name: negative_sample_raw_units list: - name: hubert dtype: string - name: pitch dtype: string - name: style dtype: string - name: model_generated_continuation dtype: audio: sampling_rate: 16000 splits: - name: train num_bytes: 86708136 num_examples: 200 download_size: 86708136 dataset_size: 86708136 - config_name: bg_all_consistency features: - name: task dtype: string - name: ind dtype: int64 - name: positive_audio dtype: audio - name: negative_audio dtype: audio - name: audio_transition_s dtype: int64 - name: prompt_audio dtype: audio: sampling_rate: 16000 - name: continuation_audio_positive dtype: audio: sampling_rate: 16000 - name: continuation_audio_negative dtype: audio: sampling_rate: 16000 - name: negative_audio_sanity dtype: audio: sampling_rate: 16000 - name: positive_sample_tokenwise_loss sequence: float32 - name: negative_sample_tokenwise_loss sequence: float32 - name: positive_continuation_tokenwise_loss sequence: float32 - name: negative_continuation_tokenwise_loss sequence: float32 - name: prompt_sample_tokenwise_loss sequence: float32 - name: model_generated_continuation dtype: audio: sampling_rate: 16000 - name: code_frame_rate dtype: int64 - name: code_depth dtype: int64 - name: model_sampling_rate dtype: int64 - name: ppl_sanity dtype: int64 - name: positive_sample_raw_units list: - name: hubert dtype: string - name: pitch dtype: string - name: style dtype: string - name: negative_sample_raw_units list: - name: hubert dtype: string - name: pitch dtype: string - name: style dtype: string - name: positive_continuation_raw_units list: - name: hubert dtype: string - name: pitch dtype: string - name: style dtype: string - name: negative_continuation_raw_units list: - name: hubert dtype: string - name: pitch dtype: string - name: style dtype: string splits: - name: train num_bytes: 222443312 num_examples: 200 download_size: 222443312 dataset_size: 222443312 - config_name: bg_domain_consistency features: - name: task dtype: string - name: ind dtype: int64 - name: positive_audio dtype: audio - name: negative_audio dtype: audio - name: audio_transition_s dtype: int64 - name: prompt_audio dtype: audio: sampling_rate: 16000 - name: continuation_audio_positive dtype: audio: sampling_rate: 16000 - name: continuation_audio_negative dtype: audio: sampling_rate: 16000 - name: negative_audio_sanity dtype: audio: sampling_rate: 16000 - name: positive_sample_tokenwise_loss sequence: float32 - name: negative_sample_tokenwise_loss sequence: float32 - name: positive_continuation_tokenwise_loss sequence: float32 - name: negative_continuation_tokenwise_loss sequence: float32 - name: prompt_sample_tokenwise_loss sequence: float32 - name: model_generated_continuation dtype: audio: sampling_rate: 16000 - name: code_frame_rate dtype: int64 - name: code_depth dtype: int64 - name: model_sampling_rate dtype: int64 - name: ppl_sanity dtype: int64 - name: positive_sample_raw_units list: - name: hubert dtype: string - name: pitch dtype: string - name: style dtype: string - name: negative_sample_raw_units list: - name: hubert dtype: string - name: pitch dtype: string - name: style dtype: string - name: positive_continuation_raw_units list: - name: hubert dtype: string - name: pitch dtype: string - name: style dtype: string - name: negative_continuation_raw_units list: - name: hubert dtype: string - name: pitch dtype: string - name: style dtype: string splits: - name: train num_bytes: 226172124 num_examples: 200 download_size: 226172124 dataset_size: 226172124 - config_name: gender_consistency features: - name: task dtype: string - name: ind dtype: int64 - name: positive_audio dtype: audio - name: negative_audio dtype: audio - name: audio_transition_s dtype: int64 - name: prompt_audio dtype: audio: sampling_rate: 16000 - name: continuation_audio_positive dtype: audio: sampling_rate: 16000 - name: continuation_audio_negative dtype: audio: sampling_rate: 16000 - name: negative_audio_sanity dtype: audio: sampling_rate: 16000 - name: positive_sample_tokenwise_loss sequence: float32 - name: negative_sample_tokenwise_loss sequence: float32 - name: positive_continuation_tokenwise_loss sequence: float32 - name: negative_continuation_tokenwise_loss sequence: float32 - name: prompt_sample_tokenwise_loss sequence: float32 - name: model_generated_continuation dtype: audio: sampling_rate: 16000 - name: code_frame_rate dtype: int64 - name: code_depth dtype: int64 - name: model_sampling_rate dtype: int64 - name: ppl_sanity dtype: int64 - name: positive_sample_raw_units list: - name: hubert dtype: string - name: pitch dtype: string - name: style dtype: string - name: negative_sample_raw_units list: - name: hubert dtype: string - name: pitch dtype: string - name: style dtype: string - name: positive_continuation_raw_units list: - name: hubert dtype: string - name: pitch dtype: string - name: style dtype: string - name: negative_continuation_raw_units list: - name: hubert dtype: string - name: pitch dtype: string - name: style dtype: string splits: - name: train num_bytes: 228058502 num_examples: 200 download_size: 228058502 dataset_size: 228058502 - config_name: rir_consistency features: - name: task dtype: string - name: ind dtype: int64 - name: positive_audio dtype: audio - name: negative_audio dtype: audio - name: audio_transition_s dtype: int64 - name: prompt_audio dtype: audio: sampling_rate: 16000 - name: continuation_audio_positive dtype: audio: sampling_rate: 16000 - name: continuation_audio_negative dtype: audio: sampling_rate: 16000 - name: negative_audio_sanity dtype: audio: sampling_rate: 16000 - name: positive_sample_tokenwise_loss sequence: float32 - name: negative_sample_tokenwise_loss sequence: float32 - name: positive_continuation_tokenwise_loss sequence: float32 - name: negative_continuation_tokenwise_loss sequence: float32 - name: prompt_sample_tokenwise_loss sequence: float32 - name: model_generated_continuation dtype: audio: sampling_rate: 16000 - name: code_frame_rate dtype: int64 - name: code_depth dtype: int64 - name: model_sampling_rate dtype: int64 - name: ppl_sanity dtype: int64 - name: positive_sample_raw_units list: - name: hubert dtype: string - name: pitch dtype: string - name: style dtype: string - name: negative_sample_raw_units list: - name: hubert dtype: string - name: pitch dtype: string - name: style dtype: string - name: positive_continuation_raw_units list: - name: hubert dtype: string - name: pitch dtype: string - name: style dtype: string - name: negative_continuation_raw_units list: - name: hubert dtype: string - name: pitch dtype: string - name: style dtype: string splits: - name: train num_bytes: 202444443 num_examples: 200 download_size: 202444443 dataset_size: 202444443 - config_name: sentiment_alignment features: - name: task dtype: string - name: ind dtype: int64 - name: positive_audio dtype: audio - name: negative_audio dtype: audio - name: prompt_audio dtype: audio: sampling_rate: 16000 - name: continuation_audio_positive dtype: audio: sampling_rate: 16000 - name: continuation_audio_negative dtype: audio: sampling_rate: 16000 - name: negative_audio_sanity dtype: audio: sampling_rate: 16000 - name: positive_sample_tokenwise_loss sequence: float32 - name: negative_sample_tokenwise_loss sequence: float32 - name: code_frame_rate dtype: int64 - name: code_depth dtype: int64 - name: model_sampling_rate dtype: int64 - name: ppl_sanity dtype: int64 - name: positive_sample_raw_units list: - name: hubert dtype: string - name: pitch dtype: string - name: style dtype: string - name: negative_sample_raw_units list: - name: hubert dtype: string - name: pitch dtype: string - name: style dtype: string - name: model_generated_continuation dtype: audio: sampling_rate: 16000 splits: - name: train num_bytes: 46555074 num_examples: 200 download_size: 46555074 dataset_size: 46555074 - config_name: sentiment_consistency features: - name: task dtype: string - name: ind dtype: int64 - name: positive_audio dtype: audio - name: negative_audio dtype: audio - name: audio_transition_s dtype: int64 - name: prompt_audio dtype: audio: sampling_rate: 16000 - name: continuation_audio_positive dtype: audio: sampling_rate: 16000 - name: continuation_audio_negative dtype: audio: sampling_rate: 16000 - name: negative_audio_sanity dtype: audio: sampling_rate: 16000 - name: positive_sample_tokenwise_loss sequence: float32 - name: negative_sample_tokenwise_loss sequence: float32 - name: positive_continuation_tokenwise_loss sequence: float32 - name: negative_continuation_tokenwise_loss sequence: float32 - name: prompt_sample_tokenwise_loss sequence: float32 - name: model_generated_continuation dtype: audio: sampling_rate: 16000 - name: code_frame_rate dtype: int64 - name: code_depth dtype: int64 - name: model_sampling_rate dtype: int64 - name: ppl_sanity dtype: int64 - name: positive_sample_raw_units list: - name: hubert dtype: string - name: pitch dtype: string - name: style dtype: string - name: negative_sample_raw_units list: - name: hubert dtype: string - name: pitch dtype: string - name: style dtype: string - name: positive_continuation_raw_units list: - name: hubert dtype: string - name: pitch dtype: string - name: style dtype: string - name: negative_continuation_raw_units list: - name: hubert dtype: string - name: pitch dtype: string - name: style dtype: string splits: - name: train num_bytes: 223684769 num_examples: 200 download_size: 223684769 dataset_size: 223684769 - config_name: speaker_consistency features: - name: task dtype: string - name: ind dtype: int64 - name: positive_audio dtype: audio - name: negative_audio dtype: audio - name: audio_transition_s dtype: int64 - name: prompt_audio dtype: audio: sampling_rate: 16000 - name: continuation_audio_positive dtype: audio: sampling_rate: 16000 - name: continuation_audio_negative dtype: audio: sampling_rate: 16000 - name: negative_audio_sanity dtype: audio: sampling_rate: 16000 - name: positive_sample_tokenwise_loss sequence: float32 - name: negative_sample_tokenwise_loss sequence: float32 - name: positive_continuation_tokenwise_loss sequence: float32 - name: negative_continuation_tokenwise_loss sequence: float32 - name: prompt_sample_tokenwise_loss sequence: float32 - name: model_generated_continuation dtype: audio: sampling_rate: 16000 - name: code_frame_rate dtype: int64 - name: code_depth dtype: int64 - name: model_sampling_rate dtype: int64 - name: ppl_sanity dtype: int64 - name: positive_sample_raw_units list: - name: hubert dtype: string - name: pitch dtype: string - name: style dtype: string - name: negative_sample_raw_units list: - name: hubert dtype: string - name: pitch dtype: string - name: style dtype: string - name: positive_continuation_raw_units list: - name: hubert dtype: string - name: pitch dtype: string - name: style dtype: string - name: negative_continuation_raw_units list: - name: hubert dtype: string - name: pitch dtype: string - name: style dtype: string splits: - name: train num_bytes: 228183961 num_examples: 200 download_size: 228183961 dataset_size: 228183961 --- # SALMon Normalized Dataset This repo preserves the SALMon per-config folder layout while normalizing mismatched schema details across model families.

提供机构：

SpeechPPL

搜集汇总

数据集介绍

构建方式

在语音合成与说话人建模领域，评估模型对声学属性的控制能力至关重要。SALMon_Spirit-LM-Expressive-normalized数据集专为此目的而构建，它源自SALMon原始数据集，但进行了关键的归一化处理。具体而言，该数据集在保留原始SALMon按配置（config）划分的文件夹布局基础上，通过统一不同模型家族间不匹配的模式（schema）细节，实现了数据结构的一致性。每个配置（如bg_alignment、speaker_consistency等）均包含200个训练样本，样本内提供了正负样本对的音频、相应的声学单元（hubert、pitch、style）以及模型生成的语言延续音频等结构化信息，为精细评测奠定了基础。

特点

该数据集的核心特色在于其对表达性语音控制的细粒度覆盖。它精心设计了八个独立配置，分别针对背景噪声对齐、各类一致性（如背景全一致性、域一致性、性别一致性、混响一致性、情感一致性、说话人一致性）以及情感对齐进行评测。每个样本均包含正/负样本对的逐词损失（tokenwise loss）和原始声学单元，从而允许研究者深入分析模型在保持说话人身份、情感、音高等属性时的表现。音频数据统一采用16kHz采样率，确保了跨配置评测的标准化与可比性。

使用方法

研究者可通过HuggingFace Datasets库便捷地加载此数据集。使用时需指定目标配置名称，例如`load_dataset('SALMon_Spirit-LM-Expressive-normalized', 'speaker_consistency', split='train')`。每个样本中的`positive_audio`与`negative_audio`字段提供了用于对比的音频片段，而`prompt_audio`及其对应的正负延续音频则支持对模型条件生成能力的评估。此外，`positive_sample_tokenwise_loss`等损失字段可用于计算不同属性下的偏好准确性，从而系统性地度量语音语言模型在表达性控制上的鲁棒性。

背景与挑战

背景概述

SALMon_Spirit-LM-Expressive-normalized数据集由研究团队构建，旨在评估和提升语音语言模型在多维表达性维度上的一致性。该数据集基于SALMon框架，通过细粒度的任务配置（如情感对齐、说话人一致性等）系统性地检测模型在音色、基频、风格等声学单元上的表现。其核心研究问题聚焦于当前生成式语音模型在执行连续语音生成时，能否保持背景音、情感、性别、房间脉冲响应等属性的稳定传递与对齐。作为Spirit-LM系列的关键评测资源，该数据集为分析语音模型在表达性控制方面的缺陷提供了标准化的参照基准，推动了可控语音生成领域的发展。

当前挑战

该数据集面临的挑战主要涵盖两大层面。在领域问题层面，语音模型常出现语义与副语言特征解耦失败的现象，表现为生成语音在情感、性别或背景声等属性上的突变或丢失，使得模型难以兼顾内容丰富性与表达一致性。在构建过程中，设计人员需要为多个声学属性（如HuBERT单元、基频、风格）构建正负样本对，并确保不同任务间的数据格式统一，这对数据标注的精度和跨模态对齐提出了严苛要求。此外，每个配置仅含200条样本的规模限制，也考验了模型在小样本条件下的泛化能力。

常用场景

经典使用场景

SALMon_Spirit-LM-Expressive-normalized数据集在语音语言模型的研究领域中扮演着重要角色，其最经典的用途在于评估和提升模型对语音表达一致性的把控能力。该数据集精心设计了包括背景对齐、情感一致性、说话人一致性、性别一致性等多个子任务配置，每个配置均包含正负样本对及对应的音频和词元级损失。这使得研究者能够系统地测试模型在生成连续语音时，是否能在音色、情感、背景噪声等维度上保持与提示音频的高度一致，从而成为衡量和优化语音语言模型表达鲁棒性的核心基准。

衍生相关工作

该数据集衍生了一系列重要的学术工作，其核心影响力在于激发了针对语音语言模型表达一致性的专项研究。围绕其定义的各类一致性任务，研究者们提出了多种改进模型，例如引入显式的说话人嵌入条件化机制或设计基于对比学习的损失函数，以强化模型在长时生成中对音色和情感特征的保持能力。此外，该数据集还被用作评估指标，在诸多探索音频离散表示与语音生成可控性的工作中被广泛引用。这些衍生研究共同推动了语音生成从‘能说话’向‘说得好、说得稳’的重要演进，深化了学界对语音表达内在连贯性的理解。

数据集最近研究