SpeechPPL/SALMon_Spirit-LM-Base-normalized

Name: SpeechPPL/SALMon_Spirit-LM-Base-normalized
Creator: SpeechPPL
Published: 2026-04-10 14:17:21
License: 暂无描述

Hugging Face2026-04-10 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/SpeechPPL/SALMon_Spirit-LM-Base-normalized

下载链接

链接失效反馈

官方服务：

资源简介：

--- configs: - config_name: bg_all_consistency data_files: - split: train path: bg_all_consistency/train-* - config_name: bg_domain_consistency data_files: - split: train path: bg_domain_consistency/train-* - config_name: gender_consistency data_files: - split: train path: gender_consistency/train-* - config_name: rir_consistency data_files: - split: train path: rir_consistency/train-* - config_name: sentiment_consistency data_files: - split: train path: sentiment_consistency/train-* - config_name: speaker_consistency data_files: - split: train path: speaker_consistency/train-* dataset_info: - config_name: bg_all_consistency features: - name: task dtype: string - name: ind dtype: int64 - name: positive_audio dtype: audio - name: negative_audio dtype: audio - name: audio_transition_s dtype: int64 - name: prompt_audio dtype: audio: sampling_rate: 16000 - name: continuation_audio_positive dtype: audio: sampling_rate: 16000 - name: continuation_audio_negative dtype: audio: sampling_rate: 16000 - name: negative_audio_sanity dtype: audio: sampling_rate: 16000 - name: positive_sample_tokenwise_loss sequence: float32 - name: negative_sample_tokenwise_loss sequence: float32 - name: positive_continuation_tokenwise_loss sequence: float32 - name: negative_continuation_tokenwise_loss sequence: float32 - name: prompt_sample_tokenwise_loss sequence: float32 - name: model_generated_continuation dtype: audio: sampling_rate: 16000 - name: code_frame_rate dtype: int64 - name: code_depth dtype: int64 - name: model_sampling_rate dtype: int64 - name: ppl_sanity dtype: int64 - name: positive_sample_raw_units list: - name: hubert dtype: string - name: negative_sample_raw_units list: - name: hubert dtype: string - name: positive_continuation_raw_units list: - name: hubert dtype: string - name: negative_continuation_raw_units list: - name: hubert dtype: string splits: - name: train num_bytes: 245778803 num_examples: 200 download_size: 245778803 dataset_size: 245778803 - config_name: bg_domain_consistency features: - name: task dtype: string - name: ind dtype: int64 - name: positive_audio dtype: audio - name: negative_audio dtype: audio - name: audio_transition_s dtype: int64 - name: prompt_audio dtype: audio: sampling_rate: 16000 - name: continuation_audio_positive dtype: audio: sampling_rate: 16000 - name: continuation_audio_negative dtype: audio: sampling_rate: 16000 - name: negative_audio_sanity dtype: audio: sampling_rate: 16000 - name: positive_sample_tokenwise_loss sequence: float32 - name: negative_sample_tokenwise_loss sequence: float32 - name: positive_continuation_tokenwise_loss sequence: float32 - name: negative_continuation_tokenwise_loss sequence: float32 - name: prompt_sample_tokenwise_loss sequence: float32 - name: model_generated_continuation dtype: audio: sampling_rate: 16000 - name: code_frame_rate dtype: int64 - name: code_depth dtype: int64 - name: model_sampling_rate dtype: int64 - name: ppl_sanity dtype: int64 - name: positive_sample_raw_units list: - name: hubert dtype: string - name: negative_sample_raw_units list: - name: hubert dtype: string - name: positive_continuation_raw_units list: - name: hubert dtype: string - name: negative_continuation_raw_units list: - name: hubert dtype: string splits: - name: train num_bytes: 248930087 num_examples: 200 download_size: 248930087 dataset_size: 248930087 - config_name: gender_consistency features: - name: task dtype: string - name: ind dtype: int64 - name: positive_audio dtype: audio - name: negative_audio dtype: audio - name: audio_transition_s dtype: int64 - name: prompt_audio dtype: audio: sampling_rate: 16000 - name: continuation_audio_positive dtype: audio: sampling_rate: 16000 - name: continuation_audio_negative dtype: audio: sampling_rate: 16000 - name: negative_audio_sanity dtype: audio: sampling_rate: 16000 - name: positive_sample_tokenwise_loss sequence: float32 - name: negative_sample_tokenwise_loss sequence: float32 - name: positive_continuation_tokenwise_loss sequence: float32 - name: negative_continuation_tokenwise_loss sequence: float32 - name: prompt_sample_tokenwise_loss sequence: float32 - name: model_generated_continuation dtype: audio: sampling_rate: 16000 - name: code_frame_rate dtype: int64 - name: code_depth dtype: int64 - name: model_sampling_rate dtype: int64 - name: ppl_sanity dtype: int64 - name: positive_sample_raw_units list: - name: hubert dtype: string - name: negative_sample_raw_units list: - name: hubert dtype: string - name: positive_continuation_raw_units list: - name: hubert dtype: string - name: negative_continuation_raw_units list: - name: hubert dtype: string splits: - name: train num_bytes: 249311097 num_examples: 200 download_size: 249311097 dataset_size: 249311097 - config_name: rir_consistency features: - name: task dtype: string - name: ind dtype: int64 - name: positive_audio dtype: audio - name: negative_audio dtype: audio - name: audio_transition_s dtype: int64 - name: prompt_audio dtype: audio: sampling_rate: 16000 - name: continuation_audio_positive dtype: audio: sampling_rate: 16000 - name: continuation_audio_negative dtype: audio: sampling_rate: 16000 - name: negative_audio_sanity dtype: audio: sampling_rate: 16000 - name: positive_sample_tokenwise_loss sequence: float32 - name: negative_sample_tokenwise_loss sequence: float32 - name: positive_continuation_tokenwise_loss sequence: float32 - name: negative_continuation_tokenwise_loss sequence: float32 - name: prompt_sample_tokenwise_loss sequence: float32 - name: model_generated_continuation dtype: audio: sampling_rate: 16000 - name: code_frame_rate dtype: int64 - name: code_depth dtype: int64 - name: model_sampling_rate dtype: int64 - name: ppl_sanity dtype: int64 - name: positive_sample_raw_units list: - name: hubert dtype: string - name: negative_sample_raw_units list: - name: hubert dtype: string - name: positive_continuation_raw_units list: - name: hubert dtype: string - name: negative_continuation_raw_units list: - name: hubert dtype: string splits: - name: train num_bytes: 231672093 num_examples: 200 download_size: 231672093 dataset_size: 231672093 - config_name: sentiment_consistency features: - name: task dtype: string - name: ind dtype: int64 - name: positive_audio dtype: audio - name: negative_audio dtype: audio - name: audio_transition_s dtype: int64 - name: prompt_audio dtype: audio: sampling_rate: 16000 - name: continuation_audio_positive dtype: audio: sampling_rate: 16000 - name: continuation_audio_negative dtype: audio: sampling_rate: 16000 - name: negative_audio_sanity dtype: audio: sampling_rate: 16000 - name: positive_sample_tokenwise_loss sequence: float32 - name: negative_sample_tokenwise_loss sequence: float32 - name: positive_continuation_tokenwise_loss sequence: float32 - name: negative_continuation_tokenwise_loss sequence: float32 - name: prompt_sample_tokenwise_loss sequence: float32 - name: model_generated_continuation dtype: audio: sampling_rate: 16000 - name: code_frame_rate dtype: int64 - name: code_depth dtype: int64 - name: model_sampling_rate dtype: int64 - name: ppl_sanity dtype: int64 - name: positive_sample_raw_units list: - name: hubert dtype: string - name: negative_sample_raw_units list: - name: hubert dtype: string - name: positive_continuation_raw_units list: - name: hubert dtype: string - name: negative_continuation_raw_units list: - name: hubert dtype: string splits: - name: train num_bytes: 247125104 num_examples: 200 download_size: 247125104 dataset_size: 247125104 - config_name: speaker_consistency features: - name: task dtype: string - name: ind dtype: int64 - name: positive_audio dtype: audio - name: negative_audio dtype: audio - name: audio_transition_s dtype: int64 - name: prompt_audio dtype: audio: sampling_rate: 16000 - name: continuation_audio_positive dtype: audio: sampling_rate: 16000 - name: continuation_audio_negative dtype: audio: sampling_rate: 16000 - name: negative_audio_sanity dtype: audio: sampling_rate: 16000 - name: positive_sample_tokenwise_loss sequence: float32 - name: negative_sample_tokenwise_loss sequence: float32 - name: positive_continuation_tokenwise_loss sequence: float32 - name: negative_continuation_tokenwise_loss sequence: float32 - name: prompt_sample_tokenwise_loss sequence: float32 - name: model_generated_continuation dtype: audio: sampling_rate: 16000 - name: code_frame_rate dtype: int64 - name: code_depth dtype: int64 - name: model_sampling_rate dtype: int64 - name: ppl_sanity dtype: int64 - name: positive_sample_raw_units list: - name: hubert dtype: string - name: negative_sample_raw_units list: - name: hubert dtype: string - name: positive_continuation_raw_units list: - name: hubert dtype: string - name: negative_continuation_raw_units list: - name: hubert dtype: string splits: - name: train num_bytes: 249761209 num_examples: 200 download_size: 249761209 dataset_size: 249761209 --- # SALMon Normalized Dataset This repo preserves the SALMon per-config folder layout while normalizing mismatched schema details across model families.

提供机构：

SpeechPPL

搜集汇总

数据集介绍

构建方式

SALMon_Spirit-LM-Base-normalized数据集以SALMon原始数据集的配置目录结构为基础，针对不同模型家族间存在的schema差异进行了规范化处理。该数据集包含八个子配置，涵盖背景对齐、背景全局一致性、背景领域一致性、性别一致性、房间脉冲响应一致性、情感对齐、情感一致性和说话者一致性等任务。每个配置均以200条训练样本构成，每条样本包含正负样本音频、提示音频、延续音频及其对应的HuBERT离散单元序列，并附带逐token损失、困惑度等评估指标，为音频语言模型的细粒度评估提供了结构化数据支撑。

使用方法

数据集可通过HuggingFace Datasets库加载，用户需根据研究目标选择对应的子配置，例如使用'bg_alignment'评估背景对齐能力或'speaker_consistency'测试说话者一致性。加载后，每条样本提供了正负样本音频及对应的HuBERT离散单元，研究者可利用这些数据计算模型在给定提示下的延续损失或对比正负样本的生成质量。此外，模型生成的延续音频和逐token损失可用于更深入的分析，例如探究模型在不同声学属性上的错误模式或进行归因研究。

背景与挑战

背景概述

在神经音频编解码与生成式语音模型迅猛发展的背景下，如何评估与提升模型对多种声学属性（如背景噪声、域一致性、说话人特征、情感韵律及混响条件）的感知与保持能力，成为语音生成领域亟待解决的关键问题。SALMon_Spirit-LM-Base-normalized 数据集由相关研究团队构建，旨在为多维度音频一致性评估提供标准化基准。该数据集基于Spirit-LM-Base模型，通过精心设计的对齐与一致性任务（包括背景对齐、背景域一致性、性别一致性、混响一致性、情感对齐、情感一致性及说话人一致性），系统性地探索模型在给定提示音频后生成连续音频时，能否准确维持或改变特定声学维度。数据集创建于近期，其核心研究问题聚焦于揭示大型语音语言模型在处理复杂声学变换时的内在机制与局限性。通过对各配置项内正负样本的逐词损失、困惑度及原始哈伯特单元的详细记录，该数据集为深入剖析模型行为提供了丰富的信息层次，对推动具备感知一致性的可控语音合成技术发展具有重要影响力。

当前挑战

该数据集所解决的领域问题核心挑战在于，现有语音生成模型往往在语义内容上表现优异，却难以在长时生成中保持或按需调控非语义声学属性，例如背景环境的连贯性、说话人身份的稳定性及情感色彩的精准传递。构建过程中面临的挑战尤为突出：首先，需为每种一致性任务精确构建正负对照音频对，确保仅目标声学维度发生变化，而其他维度严格对齐，这对音频编辑与标注流程提出了极高要求。其次，数据集的规整化过程涉及跨模型架构的schema匹配与统一，不同模型家族生成的元数据结构差异显著，逐一实现标准化极具技术复杂性。再次，每项配置仅含200个训练样本，如何在有限数据规模下有效刻画模型行为的多样性，同时确保正负例之间在逐词损失等细粒度指标上的可比性，对数据筛选与质量验证构成了严峻考验。

常用场景

经典使用场景

在语音生成与语义对齐的交叉研究领域，SALMon_Spirit-LM-Base-normalized数据集为评估和提升神经语音编解码语言模型（如Spirit-LM）的多维语义一致性提供了关键基准。该数据集精心设计了包括背景对齐、领域一致性、性别一致性、房间脉冲响应一致性、情感对齐、情感一致性和说话人一致性在内的八类子任务。每一子任务皆通过构造正负样本对及模型生成续接片段，辅以逐token损失和原始HuBERT单元，系统性地考察模型在延续语音时对原始语境中语义、声学环境和说话人特征的保持能力。研究者可利用该数据集对模型进行细粒度的偏好评估与对比分析，经典实验范式为：在给定相同提示音频的条件下，比较模型输出与正、负续接样本之间的概率差异，进而量化模型对特定属性一致性的遵循程度。

解决学术问题

该数据集的核心学术价值在于解决了语音语言模型中长期存在的语义连贯性与多模态对齐评估难题。传统的困惑度或主观听感评测难以揭示模型在延续生成时是否真正维持了背景语义、情感倾向和声学环境等高层信息。SALMon_Spirit-LM-Base-normalized通过构建结构化的正负样本对比框架，首次允许研究者从背景对齐、情感一致性和说话人保持等多个维度，对自回归语音模型进行解耦式诊断。其引入的逐token损失对比和HuBERT单元保留度分析，为理解模型内部表征与生成偏差提供了量化工具。这一数据集推动了语音生成领域从仅关注声学保真度向同时重视语义连贯性和环境适应性的范式转变，为后续构建更鲁棒、更对齐人类感知的语音交互系统奠定了评测基础。

实际应用

在实际应用层面，该数据集直接服务于智能语音助手、多模态对话系统和有声内容生成等产品的质量保障与迭代优化。例如在情感化语音交互场景中，开发者可利用情感对齐子任务的胜负样本，训练或微调模型使其生成的语音续接在情绪上始终与用户初始输入保持一致，避免突兀的情绪切换。在智能家居的多房间语音控制中，房间脉冲响应一致性评测有助于确保模型输出的虚拟声音在空间声学特征上与真实环境相融合。该数据集还能用于检测和修正语音生成中的偏见问题，如性别一致性任务可防止模型在续接对话时无意中改变说话人的性别身份。这些实际价值使得SALMon数据集成为从实验室原型迈向可靠商业化部署的重要验证工具。

数据集最近研究