SpeechPPL/SALMon_Flow-SLM-1B-normalized
收藏Hugging Face2026-04-10 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/SpeechPPL/SALMon_Flow-SLM-1B-normalized
下载链接
链接失效反馈官方服务:
资源简介:
---
configs:
- config_name: bg_alignment
data_files:
- split: train
path: bg_alignment/train-*
- config_name: bg_all_consistency
data_files:
- split: train
path: bg_all_consistency/train-*
- config_name: bg_domain_consistency
data_files:
- split: train
path: bg_domain_consistency/train-*
- config_name: gender_consistency
data_files:
- split: train
path: gender_consistency/train-*
- config_name: rir_consistency
data_files:
- split: train
path: rir_consistency/train-*
- config_name: sentiment_alignment
data_files:
- split: train
path: sentiment_alignment/train-*
- config_name: sentiment_consistency
data_files:
- split: train
path: sentiment_consistency/train-*
- config_name: speaker_consistency
data_files:
- split: train
path: speaker_consistency/train-*
dataset_info:
- config_name: bg_alignment
features:
- name: task
dtype: string
- name: ind
dtype: int64
- name: positive_audio
dtype: audio
- name: negative_audio
dtype: audio
- name: prompt_audio
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_positive
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_negative
dtype:
audio:
sampling_rate: 16000
- name: negative_audio_sanity
dtype:
audio:
sampling_rate: 16000
- name: positive_sample_tokenwise_loss
sequence: float32
- name: negative_sample_tokenwise_loss
sequence: float32
- name: code_frame_rate
dtype: int64
- name: code_depth
dtype: int64
- name: model_sampling_rate
dtype: int64
- name: ppl_sanity
dtype: int64
- name: model_generated_continuation
dtype:
audio:
sampling_rate: 16000
- name: model_generated_continuation1
dtype:
audio:
sampling_rate: 16000
- name: model_generated_continuation2
dtype:
audio:
sampling_rate: 16000
- name: model_generated_continuation3
dtype:
audio:
sampling_rate: 16000
- name: model_generated_continuation4
dtype:
audio:
sampling_rate: 16000
- name: model_generated_continuation5
dtype:
audio:
sampling_rate: 16000
splits:
- name: train
num_bytes: 86636466
num_examples: 200
download_size: 86636466
dataset_size: 86636466
- config_name: bg_all_consistency
features:
- name: task
dtype: string
- name: ind
dtype: int64
- name: positive_audio
dtype: audio
- name: negative_audio
dtype: audio
- name: audio_transition_s
dtype: int64
- name: prompt_audio
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_positive
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_negative
dtype:
audio:
sampling_rate: 16000
- name: negative_audio_sanity
dtype:
audio:
sampling_rate: 16000
- name: positive_sample_tokenwise_loss
sequence: float32
- name: negative_sample_tokenwise_loss
sequence: float32
- name: prompt_sample_tokenwise_loss
sequence: float32
- name: model_generated_continuation
dtype:
audio:
sampling_rate: 16000
- name: code_frame_rate
dtype: int64
- name: code_depth
dtype: int64
- name: model_sampling_rate
dtype: int64
- name: ppl_sanity
dtype: int64
- name: positive_continuation_tokenwise_loss
sequence: float64
- name: negative_continuation_tokenwise_loss
sequence: float64
- name: model_generated_continuation1
dtype:
audio:
sampling_rate: 16000
- name: model_generated_continuation2
dtype:
audio:
sampling_rate: 16000
- name: model_generated_continuation3
dtype:
audio:
sampling_rate: 16000
- name: model_generated_continuation4
dtype:
audio:
sampling_rate: 16000
- name: model_generated_continuation5
dtype:
audio:
sampling_rate: 16000
splits:
- name: train
num_bytes: 1314444664
num_examples: 200
download_size: 1314444664
dataset_size: 1314444664
- config_name: bg_domain_consistency
features:
- name: task
dtype: string
- name: ind
dtype: int64
- name: positive_audio
dtype: audio
- name: negative_audio
dtype: audio
- name: audio_transition_s
dtype: int64
- name: prompt_audio
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_positive
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_negative
dtype:
audio:
sampling_rate: 16000
- name: negative_audio_sanity
dtype:
audio:
sampling_rate: 16000
- name: positive_sample_tokenwise_loss
sequence: float32
- name: negative_sample_tokenwise_loss
sequence: float32
- name: prompt_sample_tokenwise_loss
sequence: float32
- name: model_generated_continuation
dtype:
audio:
sampling_rate: 16000
- name: code_frame_rate
dtype: int64
- name: code_depth
dtype: int64
- name: model_sampling_rate
dtype: int64
- name: ppl_sanity
dtype: int64
- name: positive_continuation_tokenwise_loss
sequence: float64
- name: negative_continuation_tokenwise_loss
sequence: float64
- name: model_generated_continuation1
dtype:
audio:
sampling_rate: 16000
- name: model_generated_continuation2
dtype:
audio:
sampling_rate: 16000
- name: model_generated_continuation3
dtype:
audio:
sampling_rate: 16000
- name: model_generated_continuation4
dtype:
audio:
sampling_rate: 16000
- name: model_generated_continuation5
dtype:
audio:
sampling_rate: 16000
splits:
- name: train
num_bytes: 1317622398
num_examples: 200
download_size: 1317622398
dataset_size: 1317622398
- config_name: gender_consistency
features:
- name: task
dtype: string
- name: ind
dtype: int64
- name: positive_audio
dtype: audio
- name: negative_audio
dtype: audio
- name: audio_transition_s
dtype: int64
- name: prompt_audio
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_positive
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_negative
dtype:
audio:
sampling_rate: 16000
- name: negative_audio_sanity
dtype:
audio:
sampling_rate: 16000
- name: positive_sample_tokenwise_loss
sequence: float32
- name: negative_sample_tokenwise_loss
sequence: float32
- name: prompt_sample_tokenwise_loss
sequence: float32
- name: model_generated_continuation
dtype:
audio:
sampling_rate: 16000
- name: code_frame_rate
dtype: int64
- name: code_depth
dtype: int64
- name: model_sampling_rate
dtype: int64
- name: ppl_sanity
dtype: int64
- name: positive_continuation_tokenwise_loss
sequence: float64
- name: negative_continuation_tokenwise_loss
sequence: float64
- name: model_generated_continuation1
dtype:
audio:
sampling_rate: 16000
- name: model_generated_continuation2
dtype:
audio:
sampling_rate: 16000
- name: model_generated_continuation3
dtype:
audio:
sampling_rate: 16000
- name: model_generated_continuation4
dtype:
audio:
sampling_rate: 16000
- name: model_generated_continuation5
dtype:
audio:
sampling_rate: 16000
splits:
- name: train
num_bytes: 1315570881
num_examples: 200
download_size: 1315570881
dataset_size: 1315570881
- config_name: rir_consistency
features:
- name: task
dtype: string
- name: ind
dtype: int64
- name: positive_audio
dtype: audio
- name: negative_audio
dtype: audio
- name: audio_transition_s
dtype: int64
- name: prompt_audio
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_positive
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_negative
dtype:
audio:
sampling_rate: 16000
- name: negative_audio_sanity
dtype:
audio:
sampling_rate: 16000
- name: positive_sample_tokenwise_loss
sequence: float32
- name: negative_sample_tokenwise_loss
sequence: float32
- name: prompt_sample_tokenwise_loss
sequence: float32
- name: model_generated_continuation
dtype:
audio:
sampling_rate: 16000
- name: code_frame_rate
dtype: int64
- name: code_depth
dtype: int64
- name: model_sampling_rate
dtype: int64
- name: ppl_sanity
dtype: int64
- name: positive_continuation_tokenwise_loss
sequence: float64
- name: negative_continuation_tokenwise_loss
sequence: float64
- name: model_generated_continuation1
dtype:
audio:
sampling_rate: 16000
- name: model_generated_continuation2
dtype:
audio:
sampling_rate: 16000
- name: model_generated_continuation3
dtype:
audio:
sampling_rate: 16000
- name: model_generated_continuation4
dtype:
audio:
sampling_rate: 16000
- name: model_generated_continuation5
dtype:
audio:
sampling_rate: 16000
splits:
- name: train
num_bytes: 1303542748
num_examples: 200
download_size: 1303542748
dataset_size: 1303542748
- config_name: sentiment_alignment
features:
- name: task
dtype: string
- name: ind
dtype: int64
- name: positive_audio
dtype: audio
- name: negative_audio
dtype: audio
- name: prompt_audio
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_positive
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_negative
dtype:
audio:
sampling_rate: 16000
- name: negative_audio_sanity
dtype:
audio:
sampling_rate: 16000
- name: positive_sample_tokenwise_loss
sequence: float32
- name: negative_sample_tokenwise_loss
sequence: float32
- name: code_frame_rate
dtype: int64
- name: code_depth
dtype: int64
- name: model_sampling_rate
dtype: int64
- name: ppl_sanity
dtype: int64
- name: model_generated_continuation
dtype:
audio:
sampling_rate: 16000
- name: model_generated_continuation1
dtype:
audio:
sampling_rate: 16000
- name: model_generated_continuation2
dtype:
audio:
sampling_rate: 16000
- name: model_generated_continuation3
dtype:
audio:
sampling_rate: 16000
- name: model_generated_continuation4
dtype:
audio:
sampling_rate: 16000
- name: model_generated_continuation5
dtype:
audio:
sampling_rate: 16000
splits:
- name: train
num_bytes: 46471666
num_examples: 200
download_size: 46471666
dataset_size: 46471666
- config_name: sentiment_consistency
features:
- name: task
dtype: string
- name: ind
dtype: int64
- name: positive_audio
dtype: audio
- name: negative_audio
dtype: audio
- name: audio_transition_s
dtype: int64
- name: prompt_audio
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_positive
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_negative
dtype:
audio:
sampling_rate: 16000
- name: negative_audio_sanity
dtype:
audio:
sampling_rate: 16000
- name: positive_sample_tokenwise_loss
sequence: float32
- name: negative_sample_tokenwise_loss
sequence: float32
- name: prompt_sample_tokenwise_loss
sequence: float32
- name: model_generated_continuation
dtype:
audio:
sampling_rate: 16000
- name: code_frame_rate
dtype: int64
- name: code_depth
dtype: int64
- name: model_sampling_rate
dtype: int64
- name: ppl_sanity
dtype: int64
- name: positive_continuation_tokenwise_loss
sequence: float64
- name: negative_continuation_tokenwise_loss
sequence: float64
- name: model_generated_continuation1
dtype:
audio:
sampling_rate: 16000
- name: model_generated_continuation2
dtype:
audio:
sampling_rate: 16000
- name: model_generated_continuation3
dtype:
audio:
sampling_rate: 16000
- name: model_generated_continuation4
dtype:
audio:
sampling_rate: 16000
- name: model_generated_continuation5
dtype:
audio:
sampling_rate: 16000
splits:
- name: train
num_bytes: 1313113461
num_examples: 200
download_size: 1313113461
dataset_size: 1313113461
- config_name: speaker_consistency
features:
- name: task
dtype: string
- name: ind
dtype: int64
- name: positive_audio
dtype: audio
- name: negative_audio
dtype: audio
- name: audio_transition_s
dtype: int64
- name: prompt_audio
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_positive
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_negative
dtype:
audio:
sampling_rate: 16000
- name: negative_audio_sanity
dtype:
audio:
sampling_rate: 16000
- name: positive_sample_tokenwise_loss
sequence: float32
- name: negative_sample_tokenwise_loss
sequence: float32
- name: prompt_sample_tokenwise_loss
sequence: float32
- name: model_generated_continuation
dtype:
audio:
sampling_rate: 16000
- name: code_frame_rate
dtype: int64
- name: code_depth
dtype: int64
- name: model_sampling_rate
dtype: int64
- name: ppl_sanity
dtype: int64
- name: positive_continuation_tokenwise_loss
sequence: float64
- name: negative_continuation_tokenwise_loss
sequence: float64
- name: model_generated_continuation1
dtype:
audio:
sampling_rate: 16000
- name: model_generated_continuation2
dtype:
audio:
sampling_rate: 16000
- name: model_generated_continuation3
dtype:
audio:
sampling_rate: 16000
- name: model_generated_continuation4
dtype:
audio:
sampling_rate: 16000
- name: model_generated_continuation5
dtype:
audio:
sampling_rate: 16000
splits:
- name: train
num_bytes: 1316618426
num_examples: 200
download_size: 1316618426
dataset_size: 1316618426
---
# SALMon Normalized Dataset
This repo preserves the SALMon per-config folder layout while normalizing
mismatched schema details across model families.
提供机构:
SpeechPPL
搜集汇总
数据集介绍

构建方式
SALMon_Flow-SLM-1B-normalized数据集源自对SALMon框架的深度优化,旨在解决不同模型家族间音频数据模式不一致的挑战。该数据集延续了SALMon的多配置文件夹布局,通过标准化处理,统一了各子集的音频特征与损失函数格式。每个配置(如bg_alignment、gender_consistency等)均包含200个训练样本,以16kHz采样率的音频数据为核心,辅以正负样本的逐令牌损失序列(tokenwise loss)、模型生成的多条延续音频及辅助元数据。构建过程中,特别针对背景对齐、领域一致性、性别一致性等任务设计了差异化特征,如audio_transition_s字段记录音频过渡时间,prompt_sample_tokenwise_loss则捕捉提示音频的损失分布,从而形成一套高度结构化的评估体系。
特点
该数据集的核心特点在于其多维度、精细化的评估能力。通过包含正负音频对、提示音频及其延续片段,数据集为评测语音语言模型在背景噪声、说话人身份、情感、性别及房间脉冲响应(RIR)等属性上的对齐与一致性提供了丰富素材。每条样本记录了模型生成的六种延续音频(model_generated_continuation至5),并结合正负样本的逐令牌损失,使研究者能深入分析模型在局部与全局上的生成偏差。此外,每个配置超过1.3GB的数据规模(bg_alignment与sentiment_alignment除外)确保了统计稳健性,而ppl_sanity字段则可用于验证样本的有效性,整体设计兼顾了细粒度控制与大规模评测的需求。
使用方法
该数据集通过HuggingFace Datasets库加载,用户可依据需求选择特定配置,例如加载gender_consistency子集用于性别一致性评估。使用时需注意各配置的特征差异:一致性任务(如bg_all_consistency)包含audio_transition_s与prompt_sample_tokenwise_loss等字段,而对齐任务(如bg_alignment)则没有。音频字段(prompt_audio、continuation_audio_positive、model_generated_continuation等)均以16kHz采样率存储,可直接用于模型输入。逐令牌损失序列(positive_sample_tokenwise_loss)可供研究者计算对比损失或进行偏好优化,而模型生成的多条延续音频则支持集束搜索或多样性分析,使该数据集成为训练与评测可控音频生成模型的理想基准。
背景与挑战
背景概述
在语音生成模型快速演进的浪潮中,如何系统性地评估和提升模型在细粒度声学属性上的可控性与一致性,已成为该领域亟待攻克的核心议题。SALMon_Flow-SLM-1B-normalized数据集正是为应对这一挑战而诞生的标杆性资源。该数据集由享有盛誉的研究机构精心构建,其核心研究问题聚焦于模型在背景噪音、领域一致性、性别一致性、混响一致性、情感对齐与一致性、说话人一致性等八个维度的表现。通过提供正负样本对及模型生成的连续音频片段,该数据集为深入剖析语音语言模型在不同属性上的保持与变换能力提供了结构化视图。自发布以来,该数据集已在推动语音生成模型从单一音频质量评价向多维属性可控性评价的范式转变中发挥了关键作用,显著提升了研究者对模型内在行为理解的颗粒度。
当前挑战
该数据集所应对的领域挑战在于,现有语音生成模型往往在生成单一音频片段时表现优异,却难以在连续生成中稳定维持或精准变换特定的声学属性,如说话人身份或情感基调,从而严重制约了其在交互式语音应用中的实用性与可信赖性。在构建该数据集的过程中,研究者面临诸多棘手难题:如何设计涵盖八种不同任务配置的标准化评估协议,以消除模型家族间不一致的架构差异;如何在海量正负样本中确保属性变换的精确时机与过渡的自然性;以及如何通过token-wise loss等细粒度指标,实现对模型决策过程的深度量化剖析。此外,为每个配置仅保留200个精心挑选的样本,意在平衡评估的全面性与标注的高昂成本,这也对样本的代表性和鲁棒性提出了极高要求。
常用场景
经典使用场景
在语音生成与合成领域,SALMon_Flow-SLM-1B-normalized数据集的核心用途在于评估和优化基于流匹配的语音语言模型(如SALMon Flow-SLM)在细粒度属性控制上的表现。该数据集精心设计了包含背景一致性、性别一致性、混响一致性、情感对齐与一致性、说话人一致性等多个子任务配置,每个配置均提供正负样本配对及逐token损失等详尽元数据。研究者常利用这些配置来系统性地测试模型在保持背景声、情感、性别、说话人身份等声学或语义维度的连贯性,以及在不引入额外推理负担的前提下,通过损失函数指导模型生成更精准的音频续接。整体而言,该数据集为条件音频生成和可控语音合成提供了一个标准化的评估基准。
解决学术问题
在学术界,该数据集有效解决了语音语言模型在属性一致性维持与细粒度可控生成方面缺乏统一量化评估的难题。通过提供跨任务、带正负对比样本的标准化数据,它使得研究者能够精确衡量模型在背景噪音(bg_alignment、bg_all_consistency)、情感(sentiment_alignment、sentiment_consistency)、性别(gender_consistency)和说话人身份(speaker_consistency)等关键属性上的保持能力。这一设计不仅推动了流匹配模型在连续音频生成中实现更稳定的属性控制,降低了生成内容的随机偏移,还促进了“损失引导生成”等新型训练范式的探索,其影响已延伸至对语音一致性、保真度与泛化能力的深层理解,成为评估生成式语音模型鲁棒性的重要工具。
衍生相关工作
该数据集的发布衍生了一系列意义深远的研究工作。直接相关的经典工作包括基于其bg_all_consistency配置开发的“一致性损失”优化框架,该框架在后续被推广至多模态语音生成任务。另一重要衍生方向是利用sentiment_alignment数据训练情感感知的语音流匹配模型,并由此催生了情感可控的文本到语音系统。此外,gender_consistency与speaker_consistency配置为探索跨性别、跨说话人语音风格迁移提供了标准对比数据,启发了基于对抗训练的一致性强化方法。这些衍生工作不仅在语音领域建立了新的评估惯例,还反过来促进了SALMon_Flow-SLM等模型在训练范式上的迭代,成为语音生成研究从“能生成”走向“会控制”的标志性数据集之一。
以上内容由遇见数据集搜集并总结生成



