SpeechPPL/SALMon_pGSLM-normalized
收藏Hugging Face2026-04-10 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/SpeechPPL/SALMon_pGSLM-normalized
下载链接
链接失效反馈官方服务:
资源简介:
---
configs:
- config_name: bg_alignment
data_files:
- split: train
path: bg_alignment/train-*
- config_name: bg_all_consistency
data_files:
- split: train
path: bg_all_consistency/train-*
- config_name: bg_domain_consistency
data_files:
- split: train
path: bg_domain_consistency/train-*
- config_name: gender_consistency
data_files:
- split: train
path: gender_consistency/train-*
- config_name: rir_consistency
data_files:
- split: train
path: rir_consistency/train-*
- config_name: sentiment_alignment
data_files:
- split: train
path: sentiment_alignment/train-*
- config_name: sentiment_consistency
data_files:
- split: train
path: sentiment_consistency/train-*
- config_name: speaker_consistency
data_files:
- split: train
path: speaker_consistency/train-*
dataset_info:
- config_name: bg_alignment
features:
- name: task
dtype: string
- name: ind
dtype: int64
- name: positive_sample_tokenwise_loss
list: float32
- name: negative_sample_tokenwise_loss
list: float32
- name: prompt_sample_tokenwise_loss
list: float32
- name: model_generated_continuation
dtype:
audio:
sampling_rate: 16000
- name: positive_audio
dtype:
audio:
sampling_rate: 16000
- name: negative_audio
dtype:
audio:
sampling_rate: 16000
- name: positive_sample_raw_units
dtype:
- name: hubert
dtype: string
- name: negative_sample_raw_units
dtype:
- name: hubert
dtype: string
- name: prompt_audio
dtype:
audio:
sampling_rate: 16000
- name: code_frame_rate
dtype: int64
- name: code_depth
dtype: int64
- name: model_sampling_rate
dtype: int64
- name: ppl_sanity
dtype: int64
- name: positive_continuation_tokenwise_loss
dtype: 'null'
- name: negative_continuation_tokenwise_loss
dtype: 'null'
- name: positive_continuation_raw_units
dtype:
- name: hubert
dtype: string
- name: negative_continuation_raw_units
dtype:
- name: hubert
dtype: string
- name: continuation_audio_positive
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_negative
dtype:
audio:
sampling_rate: 16000
splits:
- name: train
num_bytes: 87019899
num_examples: 200
download_size: 87019899
dataset_size: 87019899
- config_name: bg_all_consistency
features:
- name: task
dtype: string
- name: ind
dtype: int64
- name: positive_sample_tokenwise_loss
list: float32
- name: negative_sample_tokenwise_loss
list: float32
- name: prompt_sample_tokenwise_loss
list: float32
- name: model_generated_continuation
dtype:
audio:
sampling_rate: 16000
- name: positive_audio
dtype:
audio:
sampling_rate: 16000
- name: negative_audio
dtype:
audio:
sampling_rate: 16000
- name: positive_sample_raw_units
dtype:
- name: hubert
dtype: string
- name: negative_sample_raw_units
dtype:
- name: hubert
dtype: string
- name: prompt_audio
dtype:
audio:
sampling_rate: 16000
- name: code_frame_rate
dtype: int64
- name: code_depth
dtype: int64
- name: model_sampling_rate
dtype: int64
- name: ppl_sanity
dtype: int64
- name: positive_continuation_tokenwise_loss
list: float64
- name: negative_continuation_tokenwise_loss
list: float64
- name: positive_continuation_raw_units
dtype:
- name: hubert
dtype: string
- name: negative_continuation_raw_units
dtype:
- name: hubert
dtype: string
- name: continuation_audio_positive
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_negative
dtype:
audio:
sampling_rate: 16000
splits:
- name: train
num_bytes: 207143109
num_examples: 200
download_size: 207143109
dataset_size: 207143109
- config_name: bg_domain_consistency
features:
- name: task
dtype: string
- name: ind
dtype: int64
- name: positive_sample_tokenwise_loss
list: float32
- name: negative_sample_tokenwise_loss
list: float32
- name: prompt_sample_tokenwise_loss
list: float32
- name: model_generated_continuation
dtype:
audio:
sampling_rate: 16000
- name: positive_audio
dtype:
audio:
sampling_rate: 16000
- name: negative_audio
dtype:
audio:
sampling_rate: 16000
- name: positive_sample_raw_units
dtype:
- name: hubert
dtype: string
- name: negative_sample_raw_units
dtype:
- name: hubert
dtype: string
- name: prompt_audio
dtype:
audio:
sampling_rate: 16000
- name: code_frame_rate
dtype: int64
- name: code_depth
dtype: int64
- name: model_sampling_rate
dtype: int64
- name: ppl_sanity
dtype: int64
- name: positive_continuation_tokenwise_loss
list: float64
- name: negative_continuation_tokenwise_loss
list: float64
- name: positive_continuation_raw_units
dtype:
- name: hubert
dtype: string
- name: negative_continuation_raw_units
dtype:
- name: hubert
dtype: string
- name: continuation_audio_positive
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_negative
dtype:
audio:
sampling_rate: 16000
splits:
- name: train
num_bytes: 210303767
num_examples: 200
download_size: 210303767
dataset_size: 210303767
- config_name: gender_consistency
features:
- name: task
dtype: string
- name: ind
dtype: int64
- name: positive_sample_tokenwise_loss
list: float32
- name: negative_sample_tokenwise_loss
list: float32
- name: prompt_sample_tokenwise_loss
list: float32
- name: model_generated_continuation
dtype:
audio:
sampling_rate: 16000
- name: positive_audio
dtype:
audio:
sampling_rate: 16000
- name: negative_audio
dtype:
audio:
sampling_rate: 16000
- name: positive_sample_raw_units
dtype:
- name: hubert
dtype: string
- name: negative_sample_raw_units
dtype:
- name: hubert
dtype: string
- name: prompt_audio
dtype:
audio:
sampling_rate: 16000
- name: code_frame_rate
dtype: int64
- name: code_depth
dtype: int64
- name: model_sampling_rate
dtype: int64
- name: ppl_sanity
dtype: int64
- name: positive_continuation_tokenwise_loss
list: float64
- name: negative_continuation_tokenwise_loss
list: float64
- name: positive_continuation_raw_units
dtype:
- name: hubert
dtype: string
- name: negative_continuation_raw_units
dtype:
- name: hubert
dtype: string
- name: continuation_audio_positive
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_negative
dtype:
audio:
sampling_rate: 16000
splits:
- name: train
num_bytes: 209062383
num_examples: 200
download_size: 209062383
dataset_size: 209062383
- config_name: rir_consistency
features:
- name: task
dtype: string
- name: ind
dtype: int64
- name: positive_sample_tokenwise_loss
list: float32
- name: negative_sample_tokenwise_loss
list: float32
- name: prompt_sample_tokenwise_loss
list: float32
- name: model_generated_continuation
dtype:
audio:
sampling_rate: 16000
- name: positive_audio
dtype:
audio:
sampling_rate: 16000
- name: negative_audio
dtype:
audio:
sampling_rate: 16000
- name: positive_sample_raw_units
dtype:
- name: hubert
dtype: string
- name: negative_sample_raw_units
dtype:
- name: hubert
dtype: string
- name: prompt_audio
dtype:
audio:
sampling_rate: 16000
- name: code_frame_rate
dtype: int64
- name: code_depth
dtype: int64
- name: model_sampling_rate
dtype: int64
- name: ppl_sanity
dtype: int64
- name: positive_continuation_tokenwise_loss
list: float64
- name: negative_continuation_tokenwise_loss
list: float64
- name: positive_continuation_raw_units
dtype:
- name: hubert
dtype: string
- name: negative_continuation_raw_units
dtype:
- name: hubert
dtype: string
- name: continuation_audio_positive
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_negative
dtype:
audio:
sampling_rate: 16000
splits:
- name: train
num_bytes: 202129101
num_examples: 200
download_size: 202129101
dataset_size: 202129101
- config_name: sentiment_alignment
features:
- name: task
dtype: string
- name: ind
dtype: int64
- name: positive_sample_tokenwise_loss
list: float32
- name: negative_sample_tokenwise_loss
list: float32
- name: prompt_sample_tokenwise_loss
list: float32
- name: model_generated_continuation
dtype:
audio:
sampling_rate: 16000
- name: positive_audio
dtype:
audio:
sampling_rate: 16000
- name: negative_audio
dtype:
audio:
sampling_rate: 16000
- name: positive_sample_raw_units
dtype:
- name: hubert
dtype: string
- name: negative_sample_raw_units
dtype:
- name: hubert
dtype: string
- name: prompt_audio
dtype:
audio:
sampling_rate: 16000
- name: code_frame_rate
dtype: int64
- name: code_depth
dtype: int64
- name: model_sampling_rate
dtype: int64
- name: ppl_sanity
dtype: int64
- name: positive_continuation_tokenwise_loss
dtype: 'null'
- name: negative_continuation_tokenwise_loss
dtype: 'null'
- name: positive_continuation_raw_units
dtype:
- name: hubert
dtype: string
- name: negative_continuation_raw_units
dtype:
- name: hubert
dtype: string
- name: continuation_audio_positive
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_negative
dtype:
audio:
sampling_rate: 16000
splits:
- name: train
num_bytes: 46742156
num_examples: 200
download_size: 46742156
dataset_size: 46742156
- config_name: sentiment_consistency
features:
- name: task
dtype: string
- name: ind
dtype: int64
- name: positive_sample_tokenwise_loss
list: float32
- name: negative_sample_tokenwise_loss
list: float32
- name: prompt_sample_tokenwise_loss
list: float32
- name: model_generated_continuation
dtype:
audio:
sampling_rate: 16000
- name: positive_audio
dtype:
audio:
sampling_rate: 16000
- name: negative_audio
dtype:
audio:
sampling_rate: 16000
- name: positive_sample_raw_units
dtype:
- name: hubert
dtype: string
- name: negative_sample_raw_units
dtype:
- name: hubert
dtype: string
- name: prompt_audio
dtype:
audio:
sampling_rate: 16000
- name: code_frame_rate
dtype: int64
- name: code_depth
dtype: int64
- name: model_sampling_rate
dtype: int64
- name: ppl_sanity
dtype: int64
- name: positive_continuation_tokenwise_loss
list: float64
- name: negative_continuation_tokenwise_loss
list: float64
- name: positive_continuation_raw_units
dtype:
- name: hubert
dtype: string
- name: negative_continuation_raw_units
dtype:
- name: hubert
dtype: string
- name: continuation_audio_positive
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_negative
dtype:
audio:
sampling_rate: 16000
splits:
- name: train
num_bytes: 210124373
num_examples: 200
download_size: 210124373
dataset_size: 210124373
- config_name: speaker_consistency
features:
- name: task
dtype: string
- name: ind
dtype: int64
- name: positive_sample_tokenwise_loss
list: float32
- name: negative_sample_tokenwise_loss
list: float32
- name: prompt_sample_tokenwise_loss
list: float32
- name: model_generated_continuation
dtype:
audio:
sampling_rate: 16000
- name: positive_audio
dtype:
audio:
sampling_rate: 16000
- name: negative_audio
dtype:
audio:
sampling_rate: 16000
- name: positive_sample_raw_units
dtype:
- name: hubert
dtype: string
- name: negative_sample_raw_units
dtype:
- name: hubert
dtype: string
- name: prompt_audio
dtype:
audio:
sampling_rate: 16000
- name: code_frame_rate
dtype: int64
- name: code_depth
dtype: int64
- name: model_sampling_rate
dtype: int64
- name: ppl_sanity
dtype: int64
- name: positive_continuation_tokenwise_loss
list: float64
- name: negative_continuation_tokenwise_loss
list: float64
- name: positive_continuation_raw_units
dtype:
- name: hubert
dtype: string
- name: negative_continuation_raw_units
dtype:
- name: hubert
dtype: string
- name: continuation_audio_positive
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_negative
dtype:
audio:
sampling_rate: 16000
splits:
- name: train
num_bytes: 209864717
num_examples: 200
download_size: 209864717
dataset_size: 209864717
---
# SALMon Normalized Dataset
This repo preserves the SALMon per-config folder layout while normalizing
mismatched schema details across model families.
提供机构:
SpeechPPL
搜集汇总
数据集介绍

构建方式
在语音生成模型快速演进的背景下,如何对模型输出进行系统化、多维度的评测成为关键挑战。SALMon_pGSLM-normalized数据集正是为应对这一需求而构建,它基于SALMon框架,对来自不同模型家族的pGSLM输出进行了精细化的后处理与归一化。数据集的构建核心在于保留了原始的按配置(config)组织的文件夹布局,同时精心对齐了不同模型间在模式(schema)上存在的不一致之处。具体而言,它包含了bg_alignment、bg_all_consistency、gender_consistency等八个子配置,每个配置均含有200个训练样本,囊括了正负样本的音频、模型生成的延续音频、基于HuBERT的离散单元、基频和时长等声学特征,以及逐词元(token-wise)的损失值,为深入剖析模型在特定维度上的表现提供了结构化素材。
特点
该数据集最显著的特点在于其高度的结构化和多维度评测覆盖。首先,数据集通过八个精心设计的配置,系统性地评估了语音生成模型在背景对齐(bg_alignment)、跨领域与性别一致性、说话人稳定性、情感对齐与一致性以及混响环境适应性(rir_consistency)等方面的能力。其次,每个样本都提供了丰富的语义与声学元数据,包括正负样本的音频及其在HuBERT、基频和时长上的离散化表示,以及模型生成的延续音频和对应的逐词元损失向量,这种细粒度的信息使得研究人员可以深入探究模型在生成过程中各个时间步上的表现。此外,所有音频均统一采用16kHz采样率,确保了数据格式的标准化与可复现性。
使用方法
使用该数据集时,用户可通过HuggingFace Datasets库按配置名称加载特定子集,例如使用load_dataset('SALMon_pGSLM-normalized', 'bg_alignment', split='train')来获取背景对齐任务的训练数据。每个样本中的音频字段可直接用于播放或进行特征分析,而离散单元(raw_units)可用于计算模型输出的声学保真度。逐词元损失值(tokenwise_loss)为对比正负样本在生成各个步骤上的表现提供了量化基础,适合用于构建偏好对齐(preference alignment)或模型行为诊断任务。研究人员还可利用模型生成的延续音频(model_generated_continuation)与正负样本进行听觉或自动评测对比,从而综合评估语音生成模型在特定属性上的受控生成能力。
背景与挑战
背景概述
SALMon_pGSLM-normalized数据集诞生于生成式口语语言模型飞速发展的时代,旨在系统性地评估与提升模型在零样本语音生成任务中的可控性与一致性。该数据集由相关研究团队构建,围绕背景对齐、领域一致性、情感对齐与说话人一致性等八大核心任务,通过精心设计的正负样本对及丰富的音高、时长、HuBERT编码等低级声学单元,为探究模型对音色、韵律、情感、背景噪声等声学属性的泛化能力提供了标准化评测基准。其研究焦点在于揭示生成模型在复杂声学条件间的迁移与维持一致性能力的瓶颈,对推动口语生成模型的鲁棒性与可解释性发展具有显著影响。
当前挑战
该数据集所应对的领域挑战在于,现有生成式口语模型在模仿语音时,往往难以在保持内容准确性的前提下,精准控制并维持目标背景(如RIR)、情感、性别或说话人身份等声学属性,导致生成语音的感知一致性与实用价值受限。构建过程中,核心挑战源于跨模型家族间数据模式的异构性——不同模型输出特征在格式化及语义对齐上存在错配,因此需要设计精细的归一化策略以统一字段结构,同时确保正负样本对在任务定义下的有效性,以及各子任务间损失退避(loss fallback)等逻辑的自洽性,从而为模型评测提供可靠的数据基础。
常用场景
经典使用场景
在语音合成与生成模型的研究领域中,SALMon_pGSLM-normalized数据集为评估和提升生成式语音语言模型的保真度与可控性提供了关键支撑。该数据集通过精心构建的多个配置子集,如bg_alignment、speaker_consistency及sentiment_alignment等,聚焦于衡量模型在生成语音时对背景噪声、说话人特征、情感倾向以及韵律结构等多维属性的保持能力。研究者可借助其中包含的正负样本对、逐词损失函数以及原始声学单元(如Hubert编码、音高和时长)来系统性地诊断模型在特定属性上的偏差。经典使用场景包括对比分析不同模型架构在维持背景一致性或情感对齐上的表现,从而为优化生成模型的细粒度控制奠定基础。
衍生相关工作
SALMon_pGSLM-normalized数据集的推出催生了一系列围绕生成式语音模型保真度评估的经典工作。基于其提供的多维度一致性子集,研究者们开发了如属性感知的对比学习框架,专门用于增强模型在说话人身份或情感维度上的生成稳定性。部分工作进一步拓展了数据集的评估范式,将背景一致性测试与混响特性分析相结合,衍生出针对环境鲁棒性的新指标体系。还有研究以其为蓝本,构建了跨语种或跨模态的属性对齐基准,推动了多模态生成模型中声学与语义信息的融合探索。这些衍生工作共同丰富了可控语音生成的理论工具箱,并持续影响着该领域的研究方向。
数据集最近研究
最新研究方向
面向语音生成模型的细粒度可控性与一致性评估基准构建。随着生成式语音模型在情感表达、说话人辨识、背景音效与声学环境模拟等维度展现出日益强大的能力,如何系统性地量化模型在跨维度属性迁移时的鲁棒性与忠实度成为研究焦点。该数据集通过精心设计的正负样本对,围绕背景对齐、领域一致性、性别保持、混响鲁棒性、情感对齐与说话人连贯性等核心任务,为评测大规模语音语言模型在连续生成过程中对高层语义和低层声学特征的保持能力提供了标准化测试套件。其独特的token级损失分析与离散化语音单元(HuBERT、基频、时长)的联合标注范式,有力地推动了语音质量评估从简单MOS评分向可解释的细粒度诊断方向演进,为构建更可靠、更可控的下一代语音交互系统奠定了评测基石。
以上内容由遇见数据集搜集并总结生成



