SpeechPPL/SALMon_TASLM-normalized
收藏Hugging Face2026-04-10 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/SpeechPPL/SALMon_TASLM-normalized
下载链接
链接失效反馈官方服务:
资源简介:
---
configs:
- config_name: bg_alignment
data_files:
- split: train
path: bg_alignment/train-*
- config_name: bg_all_consistency
data_files:
- split: train
path: bg_all_consistency/train-*
- config_name: bg_domain_consistency
data_files:
- split: train
path: bg_domain_consistency/train-*
- config_name: gender_consistency
data_files:
- split: train
path: gender_consistency/train-*
- config_name: rir_consistency
data_files:
- split: train
path: rir_consistency/train-*
- config_name: sentiment_alignment
data_files:
- split: train
path: sentiment_alignment/train-*
- config_name: sentiment_consistency
data_files:
- split: train
path: sentiment_consistency/train-*
- config_name: speaker_consistency
data_files:
- split: train
path: speaker_consistency/train-*
dataset_info:
- config_name: bg_alignment
features:
- name: task
dtype: string
- name: ind
dtype: int64
- name: positive_audio
dtype: audio
- name: negative_audio
dtype: audio
- name: prompt_audio
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_positive
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_negative
dtype:
audio:
sampling_rate: 16000
- name: negative_audio_sanity
dtype:
audio:
sampling_rate: 16000
- name: positive_asr_text
dtype: string
- name: positive_spk_embed
sequence: float32
- name: negative_asr_text
dtype: string
- name: negative_spk_embed
sequence: float32
- name: positive_sample_wordlevel_loss
sequence: float32
- name: negative_sample_wordlevel_loss
sequence: float32
- name: code_frame_rate
dtype: string
- name: code_depth
dtype: int64
- name: model_sampling_rate
dtype: int64
- name: ppl_sanity
dtype: int64
- name: positive_asr_text_old
dtype: string
- name: negative_asr_text_old
dtype: string
- name: negative_sample_wordlevel_loss_old
sequence: float64
- name: positive_sample_wordlevel_loss_old
sequence: float64
- name: positive_asr_chunks
list:
- name: text
dtype: string
- name: timestamp
sequence: float64
- name: ppl_sanity_aligned
dtype: int64
splits:
- name: train
num_bytes: 87024318
num_examples: 200
download_size: 87024318
dataset_size: 87024318
- config_name: bg_all_consistency
features:
- name: task
dtype: string
- name: ind
dtype: int64
- name: positive_audio
dtype: audio
- name: negative_audio
dtype: audio
- name: audio_transition_s
dtype: int64
- name: prompt_audio
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_positive
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_negative
dtype:
audio:
sampling_rate: 16000
- name: negative_audio_sanity
dtype:
audio:
sampling_rate: 16000
- name: positive_asr_text
dtype: string
- name: positive_spk_embed
sequence: float32
- name: negative_asr_text
dtype: string
- name: negative_spk_embed
sequence: float32
- name: prompt_asr_text
dtype: string
- name: prompt_spk_embed
sequence: float32
- name: positive_sample_wordlevel_loss
sequence: float32
- name: negative_sample_wordlevel_loss
sequence: float32
- name: prompt_sample_wordlevel_loss
sequence: float32
- name: code_frame_rate
dtype: string
- name: code_depth
dtype: int64
- name: model_sampling_rate
dtype: int64
- name: ppl_sanity
dtype: int64
- name: model_generated_continuation
dtype:
audio:
sampling_rate: 22050
- name: positive_asr_text_old
dtype: string
- name: negative_asr_text_old
dtype: string
- name: negative_sample_wordlevel_loss_old
sequence: float64
- name: positive_sample_wordlevel_loss_old
sequence: float64
- name: positive_asr_chunks
list:
- name: text
dtype: string
- name: timestamp
sequence: float64
- name: prompt_asr_text_old
dtype: string
- name: prompt_sample_wordlevel_loss_old
sequence: float64
- name: positive_continuation_wordlevel_loss
sequence: float32
- name: negative_continuation_wordlevel_loss
sequence: float32
- name: continuation_asr_text
dtype: string
- name: ppl_sanity_aligned
dtype: int64
splits:
- name: train
num_bytes: 270240761
num_examples: 200
download_size: 270240761
dataset_size: 270240761
- config_name: bg_domain_consistency
features:
- name: task
dtype: string
- name: ind
dtype: int64
- name: positive_audio
dtype: audio
- name: negative_audio
dtype: audio
- name: audio_transition_s
dtype: int64
- name: prompt_audio
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_positive
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_negative
dtype:
audio:
sampling_rate: 16000
- name: negative_audio_sanity
dtype:
audio:
sampling_rate: 16000
- name: positive_asr_text
dtype: string
- name: positive_spk_embed
sequence: float32
- name: negative_asr_text
dtype: string
- name: negative_spk_embed
sequence: float32
- name: prompt_asr_text
dtype: string
- name: prompt_spk_embed
sequence: float32
- name: positive_sample_wordlevel_loss
sequence: float32
- name: negative_sample_wordlevel_loss
sequence: float32
- name: prompt_sample_wordlevel_loss
sequence: float32
- name: code_frame_rate
dtype: string
- name: code_depth
dtype: int64
- name: model_sampling_rate
dtype: int64
- name: ppl_sanity
dtype: int64
- name: model_generated_continuation
dtype:
audio:
sampling_rate: 22050
- name: positive_asr_text_old
dtype: string
- name: negative_asr_text_old
dtype: string
- name: negative_sample_wordlevel_loss_old
sequence: float64
- name: positive_sample_wordlevel_loss_old
sequence: float64
- name: positive_asr_chunks
list:
- name: text
dtype: string
- name: timestamp
sequence: float64
- name: prompt_asr_text_old
dtype: string
- name: prompt_sample_wordlevel_loss_old
sequence: float64
- name: positive_continuation_wordlevel_loss
sequence: float32
- name: negative_continuation_wordlevel_loss
sequence: float32
- name: continuation_asr_text
dtype: string
- name: ppl_sanity_aligned
dtype: int64
splits:
- name: train
num_bytes: 270011718
num_examples: 200
download_size: 270011718
dataset_size: 270011718
- config_name: gender_consistency
features:
- name: task
dtype: string
- name: ind
dtype: int64
- name: positive_audio
dtype: audio
- name: negative_audio
dtype: audio
- name: audio_transition_s
dtype: int64
- name: prompt_audio
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_positive
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_negative
dtype:
audio:
sampling_rate: 16000
- name: negative_audio_sanity
dtype:
audio:
sampling_rate: 16000
- name: positive_asr_text
dtype: string
- name: positive_spk_embed
sequence: float32
- name: negative_asr_text
dtype: string
- name: negative_spk_embed
sequence: float32
- name: prompt_asr_text
dtype: string
- name: prompt_spk_embed
sequence: float32
- name: positive_sample_wordlevel_loss
sequence: float32
- name: negative_sample_wordlevel_loss
sequence: float32
- name: prompt_sample_wordlevel_loss
sequence: float32
- name: code_frame_rate
dtype: string
- name: code_depth
dtype: int64
- name: model_sampling_rate
dtype: int64
- name: ppl_sanity
dtype: int64
- name: model_generated_continuation
dtype:
audio:
sampling_rate: 22050
- name: positive_asr_text_old
dtype: string
- name: negative_asr_text_old
dtype: string
- name: negative_sample_wordlevel_loss_old
sequence: float64
- name: positive_sample_wordlevel_loss_old
sequence: float64
- name: positive_asr_chunks
list:
- name: text
dtype: string
- name: timestamp
sequence: float64
- name: prompt_asr_text_old
dtype: string
- name: prompt_sample_wordlevel_loss_old
sequence: float64
- name: positive_continuation_wordlevel_loss
sequence: float32
- name: negative_continuation_wordlevel_loss
sequence: float32
- name: continuation_asr_text
dtype: string
- name: ppl_sanity_aligned
dtype: int64
splits:
- name: train
num_bytes: 278743927
num_examples: 200
download_size: 278743927
dataset_size: 278743927
- config_name: rir_consistency
features:
- name: task
dtype: string
- name: ind
dtype: int64
- name: positive_audio
dtype: audio
- name: negative_audio
dtype: audio
- name: audio_transition_s
dtype: int64
- name: prompt_audio
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_positive
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_negative
dtype:
audio:
sampling_rate: 16000
- name: negative_audio_sanity
dtype:
audio:
sampling_rate: 16000
- name: positive_asr_text
dtype: string
- name: positive_spk_embed
sequence: float32
- name: negative_asr_text
dtype: string
- name: negative_spk_embed
sequence: float32
- name: prompt_asr_text
dtype: string
- name: prompt_spk_embed
sequence: float32
- name: positive_sample_wordlevel_loss
sequence: float32
- name: negative_sample_wordlevel_loss
sequence: float32
- name: prompt_sample_wordlevel_loss
sequence: float32
- name: code_frame_rate
dtype: string
- name: code_depth
dtype: int64
- name: model_sampling_rate
dtype: int64
- name: ppl_sanity
dtype: int64
- name: model_generated_continuation
dtype:
audio:
sampling_rate: 22050
- name: positive_asr_text_old
dtype: string
- name: negative_asr_text_old
dtype: string
- name: negative_sample_wordlevel_loss_old
sequence: float64
- name: positive_sample_wordlevel_loss_old
sequence: float64
- name: positive_asr_chunks
list:
- name: text
dtype: string
- name: timestamp
sequence: float64
- name: prompt_asr_text_old
dtype: string
- name: prompt_sample_wordlevel_loss_old
sequence: float64
- name: positive_continuation_wordlevel_loss
sequence: float32
- name: negative_continuation_wordlevel_loss
sequence: float32
- name: continuation_asr_text
dtype: string
- name: ppl_sanity_aligned
dtype: int64
splits:
- name: train
num_bytes: 259091116
num_examples: 200
download_size: 259091116
dataset_size: 259091116
- config_name: sentiment_alignment
features:
- name: task
dtype: string
- name: ind
dtype: int64
- name: positive_audio
dtype: audio
- name: negative_audio
dtype: audio
- name: prompt_audio
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_positive
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_negative
dtype:
audio:
sampling_rate: 16000
- name: negative_audio_sanity
dtype:
audio:
sampling_rate: 16000
- name: positive_asr_text
dtype: string
- name: positive_spk_embed
sequence: float32
- name: negative_asr_text
dtype: string
- name: negative_spk_embed
sequence: float32
- name: positive_sample_wordlevel_loss
sequence: float32
- name: negative_sample_wordlevel_loss
sequence: float32
- name: code_frame_rate
dtype: string
- name: code_depth
dtype: int64
- name: model_sampling_rate
dtype: int64
- name: ppl_sanity
dtype: int64
- name: positive_asr_text_old
dtype: string
- name: negative_asr_text_old
dtype: string
- name: negative_sample_wordlevel_loss_old
sequence: float64
- name: positive_sample_wordlevel_loss_old
sequence: float64
- name: positive_asr_chunks
list:
- name: text
dtype: string
- name: timestamp
sequence: float64
- name: ppl_sanity_aligned
dtype: int64
splits:
- name: train
num_bytes: 46920619
num_examples: 200
download_size: 46920619
dataset_size: 46920619
- config_name: sentiment_consistency
features:
- name: task
dtype: string
- name: ind
dtype: int64
- name: positive_audio
dtype: audio
- name: negative_audio
dtype: audio
- name: audio_transition_s
dtype: int64
- name: prompt_audio
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_positive
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_negative
dtype:
audio:
sampling_rate: 16000
- name: negative_audio_sanity
dtype:
audio:
sampling_rate: 16000
- name: positive_asr_text
dtype: string
- name: positive_spk_embed
sequence: float32
- name: negative_asr_text
dtype: string
- name: negative_spk_embed
sequence: float32
- name: prompt_asr_text
dtype: string
- name: prompt_spk_embed
sequence: float32
- name: positive_sample_wordlevel_loss
sequence: float32
- name: negative_sample_wordlevel_loss
sequence: float32
- name: prompt_sample_wordlevel_loss
sequence: float32
- name: code_frame_rate
dtype: string
- name: code_depth
dtype: int64
- name: model_sampling_rate
dtype: int64
- name: ppl_sanity
dtype: int64
- name: model_generated_continuation
dtype:
audio:
sampling_rate: 22050
- name: positive_asr_text_old
dtype: string
- name: negative_asr_text_old
dtype: string
- name: negative_sample_wordlevel_loss_old
sequence: float64
- name: positive_sample_wordlevel_loss_old
sequence: float64
- name: positive_asr_chunks
list:
- name: text
dtype: string
- name: timestamp
sequence: float64
- name: prompt_asr_text_old
dtype: string
- name: prompt_sample_wordlevel_loss_old
sequence: float64
- name: positive_continuation_wordlevel_loss
sequence: float32
- name: negative_continuation_wordlevel_loss
sequence: float32
- name: continuation_asr_text
dtype: string
- name: ppl_sanity_aligned
dtype: int64
splits:
- name: train
num_bytes: 272654768
num_examples: 200
download_size: 272654768
dataset_size: 272654768
- config_name: speaker_consistency
features:
- name: task
dtype: string
- name: ind
dtype: int64
- name: positive_audio
dtype: audio
- name: negative_audio
dtype: audio
- name: audio_transition_s
dtype: int64
- name: prompt_audio
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_positive
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_negative
dtype:
audio:
sampling_rate: 16000
- name: negative_audio_sanity
dtype:
audio:
sampling_rate: 16000
- name: positive_asr_text
dtype: string
- name: positive_spk_embed
sequence: float32
- name: negative_asr_text
dtype: string
- name: negative_spk_embed
sequence: float32
- name: prompt_asr_text
dtype: string
- name: prompt_spk_embed
sequence: float32
- name: positive_sample_wordlevel_loss
sequence: float32
- name: negative_sample_wordlevel_loss
sequence: float32
- name: prompt_sample_wordlevel_loss
sequence: float32
- name: code_frame_rate
dtype: string
- name: code_depth
dtype: int64
- name: model_sampling_rate
dtype: int64
- name: ppl_sanity
dtype: int64
- name: model_generated_continuation
dtype:
audio:
sampling_rate: 22050
- name: positive_asr_text_old
dtype: string
- name: negative_asr_text_old
dtype: string
- name: negative_sample_wordlevel_loss_old
sequence: float64
- name: positive_sample_wordlevel_loss_old
sequence: float64
- name: positive_asr_chunks
list:
- name: text
dtype: string
- name: timestamp
sequence: float64
- name: prompt_asr_text_old
dtype: string
- name: prompt_sample_wordlevel_loss_old
sequence: float64
- name: positive_continuation_wordlevel_loss
sequence: float32
- name: negative_continuation_wordlevel_loss
sequence: float32
- name: continuation_asr_text
dtype: string
- name: ppl_sanity_aligned
dtype: int64
splits:
- name: train
num_bytes: 283506280
num_examples: 200
download_size: 283506280
dataset_size: 283506280
---
# SALMon Normalized Dataset
This repo preserves the SALMon per-config folder layout while normalizing
mismatched schema details across model families.
提供机构:
SpeechPPL
搜集汇总
数据集介绍

构建方式
SALMon_TASLM-normalized数据集是在语音生成与评估领域背景下构建的,旨在为语音语言模型的细粒度评测提供标准化数据资源。该数据集由多个子配置组成,每个配置对应一种特定的评估任务,如bg_alignment、speaker_consistency、sentiment_alignment等。每个子配置均包含成对的正负样本音频及其对应的转录文本、说话人嵌入向量、词级损失等元数据。通过统一不同模型家族间存在差异的模式结构,该数据集在保留原始SALMon各配置文件夹布局的基础上,实现了字段类型的归一化与对齐。
特点
该数据集的一大特点是其层次化的任务设计与多维度的样本元数据。每个样本不仅包含正负音频样本,还提供了prompt音频、延续音频、ASR文本、说话人嵌入及词级损失序列,从而支持对语音一致性、对齐性和鲁棒性的全面评估。此外,不同子配置聚焦于背景一致性、性别一致性、房间脉冲响应一致性、情感对齐与一致性等具体维度,为研究者提供了精细化的评测工具。每个配置包含200条训练样本,音频采样率统一为16kHz或22.05kHz,确保了数据格式的标准化。
使用方法
用户可通过HuggingFace Datasets库加载此数据集,并按子配置名称(如'bg_alignment'、'speaker_consistency')选择特定任务。加载后,每一条样本包含了丰富的字段,研究人员可根据评估目标灵活使用正负音频对进行对比分析,或利用prompt音频与延续音频进行条件生成效果评测。词级损失、ASR文本和说话人嵌入等元数据可用于计算各类评估指标,如词错误率、说话人相似度或生成困惑度。该数据集适用于语音语言模型的基准测试、跨配置一致性分析及细粒度错误诊断等研究场景。
背景与挑战
背景概述
SALMon_TASLM-normalized数据集诞生于语音生成与评估领域快速演进的背景下,由研究团队为系统化评估语音语言模型(TASLM)的感知质量而构建。该数据集的设计直指当前语音合成评估中缺乏细粒度、多维度的标准化测试基准这一核心研究问题。通过精心编排涵盖背景对齐、领域一致性、性别一致性、房间脉冲响应一致性、情感对齐与一致性以及说话人一致性等八个专项配置(config),它首次实现了对模型在声学环境、说话人身份、情感表达等多个语义维度上的忠实度与鲁棒性的分离式评估。该数据集的发布,为语音生成模型的横向对比与系统性改进提供了关键工具,对推动评估方法论从单一指标走向多维度、条件化范式具有重要影响力。
当前挑战
该数据集所应对的领域挑战在于,现有语音语言模型评测严重依赖均值化的全局指标(如MOS),无法揭示模型在特定条件(如背景噪声、说话人变化、情感迁移)下的表现短板,阻碍了模型在复杂声学场景下的鲁棒性提升。此外,构建过程中面临的核心挑战包括:1)如何设计正负样本对以精确分离待测的单一语义维度,同时控制其他因素不变,这要求对音频数据源进行极为精细的筛选与实时(TRANSITION)标注;2)跨模型族时出现的模式不规范(schema mismatch)问题,本数据集通过归一化方案予以解决,但确保不同配置间各字段(如不同采样率的音频、词级损失序列、说话人嵌入等)的数据结构与语义一致性,仍然是一项艰巨的数据工程挑战。
常用场景
经典使用场景
在语音生成与理解研究领域,SALMon_TASLM-normalized数据集凭借其精细划分的多个子配置(如bg_alignment、sentiment_consistency等),成为评估和提升语音语言模型在多维度一致性与对齐能力方面的标杆资源。该数据集包含配对的正负样本音频,以及丰富的辅助信息(如说话人嵌入、词级损失、ASR文本等),可用于训练模型在背景、领域、性别、情感、说话人及混响等特性上保持内在一致性。经典使用方式包括利用正负样本对比学习范式,引导模型在语音续接任务中精准捕获并维持目标声学属性,从而提升生成语音的自然度与可控性。
解决学术问题
该数据集从根本上回应了语音生成领域中一个亟待解决的学术难题:如何系统性地量化并提升语音语言模型在多种声学属性上的语义一致性与对齐精度。传统评估侧重语音内容与表面声学特征的匹配,而SALMon_TASLM-normalized通过引入多维度的一致性对比任务(如背景一致性、情感一致性、说话人一致性等),为研究者提供了评估模型是否真正理解并保持目标声学上下文的严格基准。这一设计有力推动了从单一准确率向多模态语义对齐的评测范式转变,其意义在于为构建更鲁棒、更可控的生成式语音模型奠定了数据基础,并促进了语音理解与生成领域的理论融合。
衍生相关工作
围绕SALMon_TASLM-normalized数据集,学术界已衍生出多项富有影响力的研究工作。其一致性评测体系被广泛应用于文本到语音(TTS)及语音到语音(STS)翻译系统的对比分析中,催生了诸如基于对比学习的语音表示微调方法、多任务一致性正则化训练策略等创新技术。部分工作进一步拓展了数据集定义的维度,将其中的说话人一致性、情感一致性等子任务作为核心评估指标,验证了在零样本语音克隆和跨语种情感迁移等前沿方向上的有效性。此外,该数据集作为归一化版本,也为后续研究提供了标准化的数据接口,降低了不同模型架构间的比较成本,推动了语音生成模型评测体系的规范化进程。
以上内容由遇见数据集搜集并总结生成



