SpeechPPL/SALMon_pGSLM-normalized

Name: SpeechPPL/SALMon_pGSLM-normalized
Creator: SpeechPPL
Published: 2026-04-10 14:11:28
License: 暂无描述

Hugging Face2026-04-10 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/SpeechPPL/SALMon_pGSLM-normalized

下载链接

链接失效反馈

官方服务：

资源简介：

--- configs: - config_name: bg_alignment data_files: - split: train path: bg_alignment/train-* - config_name: bg_all_consistency data_files: - split: train path: bg_all_consistency/train-* - config_name: bg_domain_consistency data_files: - split: train path: bg_domain_consistency/train-* - config_name: gender_consistency data_files: - split: train path: gender_consistency/train-* - config_name: rir_consistency data_files: - split: train path: rir_consistency/train-* - config_name: sentiment_alignment data_files: - split: train path: sentiment_alignment/train-* - config_name: sentiment_consistency data_files: - split: train path: sentiment_consistency/train-* - config_name: speaker_consistency data_files: - split: train path: speaker_consistency/train-* dataset_info: - config_name: bg_alignment features: - name: task dtype: string - name: ind dtype: int64 - name: positive_sample_tokenwise_loss list: float32 - name: negative_sample_tokenwise_loss list: float32 - name: prompt_sample_tokenwise_loss list: float32 - name: model_generated_continuation dtype: audio: sampling_rate: 16000 - name: positive_audio dtype: audio: sampling_rate: 16000 - name: negative_audio dtype: audio: sampling_rate: 16000 - name: positive_sample_raw_units dtype: - name: hubert dtype: string - name: negative_sample_raw_units dtype: - name: hubert dtype: string - name: prompt_audio dtype: audio: sampling_rate: 16000 - name: code_frame_rate dtype: int64 - name: code_depth dtype: int64 - name: model_sampling_rate dtype: int64 - name: ppl_sanity dtype: int64 - name: positive_continuation_tokenwise_loss dtype: 'null' - name: negative_continuation_tokenwise_loss dtype: 'null' - name: positive_continuation_raw_units dtype: - name: hubert dtype: string - name: negative_continuation_raw_units dtype: - name: hubert dtype: string - name: continuation_audio_positive dtype: audio: sampling_rate: 16000 - name: continuation_audio_negative dtype: audio: sampling_rate: 16000 splits: - name: train num_bytes: 87019899 num_examples: 200 download_size: 87019899 dataset_size: 87019899 - config_name: bg_all_consistency features: - name: task dtype: string - name: ind dtype: int64 - name: positive_sample_tokenwise_loss list: float32 - name: negative_sample_tokenwise_loss list: float32 - name: prompt_sample_tokenwise_loss list: float32 - name: model_generated_continuation dtype: audio: sampling_rate: 16000 - name: positive_audio dtype: audio: sampling_rate: 16000 - name: negative_audio dtype: audio: sampling_rate: 16000 - name: positive_sample_raw_units dtype: - name: hubert dtype: string - name: negative_sample_raw_units dtype: - name: hubert dtype: string - name: prompt_audio dtype: audio: sampling_rate: 16000 - name: code_frame_rate dtype: int64 - name: code_depth dtype: int64 - name: model_sampling_rate dtype: int64 - name: ppl_sanity dtype: int64 - name: positive_continuation_tokenwise_loss list: float64 - name: negative_continuation_tokenwise_loss list: float64 - name: positive_continuation_raw_units dtype: - name: hubert dtype: string - name: negative_continuation_raw_units dtype: - name: hubert dtype: string - name: continuation_audio_positive dtype: audio: sampling_rate: 16000 - name: continuation_audio_negative dtype: audio: sampling_rate: 16000 splits: - name: train num_bytes: 207143109 num_examples: 200 download_size: 207143109 dataset_size: 207143109 - config_name: bg_domain_consistency features: - name: task dtype: string - name: ind dtype: int64 - name: positive_sample_tokenwise_loss list: float32 - name: negative_sample_tokenwise_loss list: float32 - name: prompt_sample_tokenwise_loss list: float32 - name: model_generated_continuation dtype: audio: sampling_rate: 16000 - name: positive_audio dtype: audio: sampling_rate: 16000 - name: negative_audio dtype: audio: sampling_rate: 16000 - name: positive_sample_raw_units dtype: - name: hubert dtype: string - name: negative_sample_raw_units dtype: - name: hubert dtype: string - name: prompt_audio dtype: audio: sampling_rate: 16000 - name: code_frame_rate dtype: int64 - name: code_depth dtype: int64 - name: model_sampling_rate dtype: int64 - name: ppl_sanity dtype: int64 - name: positive_continuation_tokenwise_loss list: float64 - name: negative_continuation_tokenwise_loss list: float64 - name: positive_continuation_raw_units dtype: - name: hubert dtype: string - name: negative_continuation_raw_units dtype: - name: hubert dtype: string - name: continuation_audio_positive dtype: audio: sampling_rate: 16000 - name: continuation_audio_negative dtype: audio: sampling_rate: 16000 splits: - name: train num_bytes: 210303767 num_examples: 200 download_size: 210303767 dataset_size: 210303767 - config_name: gender_consistency features: - name: task dtype: string - name: ind dtype: int64 - name: positive_sample_tokenwise_loss list: float32 - name: negative_sample_tokenwise_loss list: float32 - name: prompt_sample_tokenwise_loss list: float32 - name: model_generated_continuation dtype: audio: sampling_rate: 16000 - name: positive_audio dtype: audio: sampling_rate: 16000 - name: negative_audio dtype: audio: sampling_rate: 16000 - name: positive_sample_raw_units dtype: - name: hubert dtype: string - name: negative_sample_raw_units dtype: - name: hubert dtype: string - name: prompt_audio dtype: audio: sampling_rate: 16000 - name: code_frame_rate dtype: int64 - name: code_depth dtype: int64 - name: model_sampling_rate dtype: int64 - name: ppl_sanity dtype: int64 - name: positive_continuation_tokenwise_loss list: float64 - name: negative_continuation_tokenwise_loss list: float64 - name: positive_continuation_raw_units dtype: - name: hubert dtype: string - name: negative_continuation_raw_units dtype: - name: hubert dtype: string - name: continuation_audio_positive dtype: audio: sampling_rate: 16000 - name: continuation_audio_negative dtype: audio: sampling_rate: 16000 splits: - name: train num_bytes: 209062383 num_examples: 200 download_size: 209062383 dataset_size: 209062383 - config_name: rir_consistency features: - name: task dtype: string - name: ind dtype: int64 - name: positive_sample_tokenwise_loss list: float32 - name: negative_sample_tokenwise_loss list: float32 - name: prompt_sample_tokenwise_loss list: float32 - name: model_generated_continuation dtype: audio: sampling_rate: 16000 - name: positive_audio dtype: audio: sampling_rate: 16000 - name: negative_audio dtype: audio: sampling_rate: 16000 - name: positive_sample_raw_units dtype: - name: hubert dtype: string - name: negative_sample_raw_units dtype: - name: hubert dtype: string - name: prompt_audio dtype: audio: sampling_rate: 16000 - name: code_frame_rate dtype: int64 - name: code_depth dtype: int64 - name: model_sampling_rate dtype: int64 - name: ppl_sanity dtype: int64 - name: positive_continuation_tokenwise_loss list: float64 - name: negative_continuation_tokenwise_loss list: float64 - name: positive_continuation_raw_units dtype: - name: hubert dtype: string - name: negative_continuation_raw_units dtype: - name: hubert dtype: string - name: continuation_audio_positive dtype: audio: sampling_rate: 16000 - name: continuation_audio_negative dtype: audio: sampling_rate: 16000 splits: - name: train num_bytes: 202129101 num_examples: 200 download_size: 202129101 dataset_size: 202129101 - config_name: sentiment_alignment features: - name: task dtype: string - name: ind dtype: int64 - name: positive_sample_tokenwise_loss list: float32 - name: negative_sample_tokenwise_loss list: float32 - name: prompt_sample_tokenwise_loss list: float32 - name: model_generated_continuation dtype: audio: sampling_rate: 16000 - name: positive_audio dtype: audio: sampling_rate: 16000 - name: negative_audio dtype: audio: sampling_rate: 16000 - name: positive_sample_raw_units dtype: - name: hubert dtype: string - name: negative_sample_raw_units dtype: - name: hubert dtype: string - name: prompt_audio dtype: audio: sampling_rate: 16000 - name: code_frame_rate dtype: int64 - name: code_depth dtype: int64 - name: model_sampling_rate dtype: int64 - name: ppl_sanity dtype: int64 - name: positive_continuation_tokenwise_loss dtype: 'null' - name: negative_continuation_tokenwise_loss dtype: 'null' - name: positive_continuation_raw_units dtype: - name: hubert dtype: string - name: negative_continuation_raw_units dtype: - name: hubert dtype: string - name: continuation_audio_positive dtype: audio: sampling_rate: 16000 - name: continuation_audio_negative dtype: audio: sampling_rate: 16000 splits: - name: train num_bytes: 46742156 num_examples: 200 download_size: 46742156 dataset_size: 46742156 - config_name: sentiment_consistency features: - name: task dtype: string - name: ind dtype: int64 - name: positive_sample_tokenwise_loss list: float32 - name: negative_sample_tokenwise_loss list: float32 - name: prompt_sample_tokenwise_loss list: float32 - name: model_generated_continuation dtype: audio: sampling_rate: 16000 - name: positive_audio dtype: audio: sampling_rate: 16000 - name: negative_audio dtype: audio: sampling_rate: 16000 - name: positive_sample_raw_units dtype: - name: hubert dtype: string - name: negative_sample_raw_units dtype: - name: hubert dtype: string - name: prompt_audio dtype: audio: sampling_rate: 16000 - name: code_frame_rate dtype: int64 - name: code_depth dtype: int64 - name: model_sampling_rate dtype: int64 - name: ppl_sanity dtype: int64 - name: positive_continuation_tokenwise_loss list: float64 - name: negative_continuation_tokenwise_loss list: float64 - name: positive_continuation_raw_units dtype: - name: hubert dtype: string - name: negative_continuation_raw_units dtype: - name: hubert dtype: string - name: continuation_audio_positive dtype: audio: sampling_rate: 16000 - name: continuation_audio_negative dtype: audio: sampling_rate: 16000 splits: - name: train num_bytes: 210124373 num_examples: 200 download_size: 210124373 dataset_size: 210124373 - config_name: speaker_consistency features: - name: task dtype: string - name: ind dtype: int64 - name: positive_sample_tokenwise_loss list: float32 - name: negative_sample_tokenwise_loss list: float32 - name: prompt_sample_tokenwise_loss list: float32 - name: model_generated_continuation dtype: audio: sampling_rate: 16000 - name: positive_audio dtype: audio: sampling_rate: 16000 - name: negative_audio dtype: audio: sampling_rate: 16000 - name: positive_sample_raw_units dtype: - name: hubert dtype: string - name: negative_sample_raw_units dtype: - name: hubert dtype: string - name: prompt_audio dtype: audio: sampling_rate: 16000 - name: code_frame_rate dtype: int64 - name: code_depth dtype: int64 - name: model_sampling_rate dtype: int64 - name: ppl_sanity dtype: int64 - name: positive_continuation_tokenwise_loss list: float64 - name: negative_continuation_tokenwise_loss list: float64 - name: positive_continuation_raw_units dtype: - name: hubert dtype: string - name: negative_continuation_raw_units dtype: - name: hubert dtype: string - name: continuation_audio_positive dtype: audio: sampling_rate: 16000 - name: continuation_audio_negative dtype: audio: sampling_rate: 16000 splits: - name: train num_bytes: 209864717 num_examples: 200 download_size: 209864717 dataset_size: 209864717 --- # SALMon Normalized Dataset This repo preserves the SALMon per-config folder layout while normalizing mismatched schema details across model families.

提供机构：

SpeechPPL

搜集汇总

数据集介绍

构建方式

在语音生成模型快速演进的背景下，如何对模型输出进行系统化、多维度的评测成为关键挑战。SALMon_pGSLM-normalized数据集正是为应对这一需求而构建，它基于SALMon框架，对来自不同模型家族的pGSLM输出进行了精细化的后处理与归一化。数据集的构建核心在于保留了原始的按配置（config）组织的文件夹布局，同时精心对齐了不同模型间在模式（schema）上存在的不一致之处。具体而言，它包含了bg_alignment、bg_all_consistency、gender_consistency等八个子配置，每个配置均含有200个训练样本，囊括了正负样本的音频、模型生成的延续音频、基于HuBERT的离散单元、基频和时长等声学特征，以及逐词元（token-wise）的损失值，为深入剖析模型在特定维度上的表现提供了结构化素材。

特点

该数据集最显著的特点在于其高度的结构化和多维度评测覆盖。首先，数据集通过八个精心设计的配置，系统性地评估了语音生成模型在背景对齐（bg_alignment）、跨领域与性别一致性、说话人稳定性、情感对齐与一致性以及混响环境适应性（rir_consistency）等方面的能力。其次，每个样本都提供了丰富的语义与声学元数据，包括正负样本的音频及其在HuBERT、基频和时长上的离散化表示，以及模型生成的延续音频和对应的逐词元损失向量，这种细粒度的信息使得研究人员可以深入探究模型在生成过程中各个时间步上的表现。此外，所有音频均统一采用16kHz采样率，确保了数据格式的标准化与可复现性。

使用方法

使用该数据集时，用户可通过HuggingFace Datasets库按配置名称加载特定子集，例如使用load_dataset('SALMon_pGSLM-normalized', 'bg_alignment', split='train')来获取背景对齐任务的训练数据。每个样本中的音频字段可直接用于播放或进行特征分析，而离散单元（raw_units）可用于计算模型输出的声学保真度。逐词元损失值（tokenwise_loss）为对比正负样本在生成各个步骤上的表现提供了量化基础，适合用于构建偏好对齐（preference alignment）或模型行为诊断任务。研究人员还可利用模型生成的延续音频（model_generated_continuation）与正负样本进行听觉或自动评测对比，从而综合评估语音生成模型在特定属性上的受控生成能力。

背景与挑战

背景概述

SALMon_pGSLM-normalized数据集诞生于生成式口语语言模型飞速发展的时代，旨在系统性地评估与提升模型在零样本语音生成任务中的可控性与一致性。该数据集由相关研究团队构建，围绕背景对齐、领域一致性、情感对齐与说话人一致性等八大核心任务，通过精心设计的正负样本对及丰富的音高、时长、HuBERT编码等低级声学单元，为探究模型对音色、韵律、情感、背景噪声等声学属性的泛化能力提供了标准化评测基准。其研究焦点在于揭示生成模型在复杂声学条件间的迁移与维持一致性能力的瓶颈，对推动口语生成模型的鲁棒性与可解释性发展具有显著影响。

当前挑战

该数据集所应对的领域挑战在于，现有生成式口语模型在模仿语音时，往往难以在保持内容准确性的前提下，精准控制并维持目标背景（如RIR）、情感、性别或说话人身份等声学属性，导致生成语音的感知一致性与实用价值受限。构建过程中，核心挑战源于跨模型家族间数据模式的异构性——不同模型输出特征在格式化及语义对齐上存在错配，因此需要设计精细的归一化策略以统一字段结构，同时确保正负样本对在任务定义下的有效性，以及各子任务间损失退避（loss fallback）等逻辑的自洽性，从而为模型评测提供可靠的数据基础。

常用场景

经典使用场景

在语音合成与生成模型的研究领域中，SALMon_pGSLM-normalized数据集为评估和提升生成式语音语言模型的保真度与可控性提供了关键支撑。该数据集通过精心构建的多个配置子集，如bg_alignment、speaker_consistency及sentiment_alignment等，聚焦于衡量模型在生成语音时对背景噪声、说话人特征、情感倾向以及韵律结构等多维属性的保持能力。研究者可借助其中包含的正负样本对、逐词损失函数以及原始声学单元（如Hubert编码、音高和时长）来系统性地诊断模型在特定属性上的偏差。经典使用场景包括对比分析不同模型架构在维持背景一致性或情感对齐上的表现，从而为优化生成模型的细粒度控制奠定基础。

衍生相关工作

SALMon_pGSLM-normalized数据集的推出催生了一系列围绕生成式语音模型保真度评估的经典工作。基于其提供的多维度一致性子集，研究者们开发了如属性感知的对比学习框架，专门用于增强模型在说话人身份或情感维度上的生成稳定性。部分工作进一步拓展了数据集的评估范式，将背景一致性测试与混响特性分析相结合，衍生出针对环境鲁棒性的新指标体系。还有研究以其为蓝本，构建了跨语种或跨模态的属性对齐基准，推动了多模态生成模型中声学与语义信息的融合探索。这些衍生工作共同丰富了可控语音生成的理论工具箱，并持续影响着该领域的研究方向。

数据集最近研究