SpeechPPL/SALMon_TWIST-1.3B-normalized
收藏Hugging Face2026-04-10 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/SpeechPPL/SALMon_TWIST-1.3B-normalized
下载链接
链接失效反馈官方服务:
资源简介:
---
configs:
- config_name: bg_alignment
data_files:
- split: train
path: bg_alignment/train-*
- config_name: bg_all_consistency
data_files:
- split: train
path: bg_all_consistency/train-*
- config_name: bg_domain_consistency
data_files:
- split: train
path: bg_domain_consistency/train-*
- config_name: gender_consistency
data_files:
- split: train
path: gender_consistency/train-*
- config_name: rir_consistency
data_files:
- split: train
path: rir_consistency/train-*
- config_name: sentiment_alignment
data_files:
- split: train
path: sentiment_alignment/train-*
- config_name: sentiment_consistency
data_files:
- split: train
path: sentiment_consistency/train-*
- config_name: speaker_consistency
data_files:
- split: train
path: speaker_consistency/train-*
dataset_info:
- config_name: bg_alignment
features:
- name: task
dtype: string
- name: ind
dtype: int64
- name: positive_audio
dtype: audio
- name: negative_audio
dtype: audio
- name: prompt_audio
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_positive
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_negative
dtype:
audio:
sampling_rate: 16000
- name: negative_audio_sanity
dtype:
audio:
sampling_rate: 16000
- name: positive_sample_tokenwise_loss
sequence: float32
- name: positive_sample_raw_units
dtype:
- name: hubert
dtype: string
- name: negative_sample_tokenwise_loss
sequence: float32
- name: negative_sample_raw_units
dtype:
- name: hubert
dtype: string
- name: code_frame_rate
dtype: int64
- name: code_depth
dtype: int64
- name: offset
sequence: int64
- name: model_sampling_rate
sequence: int64
- name: ppl_sanity
dtype: int64
- name: model_generated_continuation
dtype:
audio:
sampling_rate: 16000
splits:
- name: train
num_bytes: 86798378
num_examples: 200
download_size: 86798378
dataset_size: 86798378
- config_name: bg_all_consistency
features:
- name: task
dtype: string
- name: ind
dtype: int64
- name: positive_audio
dtype: audio
- name: negative_audio
dtype: audio
- name: audio_transition_s
dtype: int64
- name: prompt_audio
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_positive
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_negative
dtype:
audio:
sampling_rate: 16000
- name: negative_audio_sanity
dtype:
audio:
sampling_rate: 16000
- name: positive_sample_tokenwise_loss
sequence: float32
- name: positive_sample_raw_units
dtype:
- name: hubert
dtype: string
- name: negative_sample_tokenwise_loss
sequence: float32
- name: negative_sample_raw_units
dtype:
- name: hubert
dtype: string
- name: prompt_sample_tokenwise_loss
sequence: float32
- name: prompt_sample_raw_units
sequence: int32
- name: positive_continuation_tokenwise_loss
sequence: float32
- name: positive_continuation_raw_units
dtype:
- name: hubert
dtype: string
- name: negative_continuation_tokenwise_loss
sequence: float32
- name: negative_continuation_raw_units
dtype:
- name: hubert
dtype: string
- name: model_generated_continuation
dtype:
audio:
sampling_rate: 16000
- name: code_frame_rate
dtype: int64
- name: code_depth
dtype: int64
- name: offset
sequence: int64
- name: model_sampling_rate
sequence: int64
- name: ppl_sanity
dtype: int64
splits:
- name: train
num_bytes: 233273821
num_examples: 200
download_size: 233273821
dataset_size: 233273821
- config_name: bg_domain_consistency
features:
- name: task
dtype: string
- name: ind
dtype: int64
- name: positive_audio
dtype: audio
- name: negative_audio
dtype: audio
- name: audio_transition_s
dtype: int64
- name: prompt_audio
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_positive
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_negative
dtype:
audio:
sampling_rate: 16000
- name: negative_audio_sanity
dtype:
audio:
sampling_rate: 16000
- name: positive_sample_tokenwise_loss
sequence: float32
- name: positive_sample_raw_units
dtype:
- name: hubert
dtype: string
- name: negative_sample_tokenwise_loss
sequence: float32
- name: negative_sample_raw_units
dtype:
- name: hubert
dtype: string
- name: prompt_sample_tokenwise_loss
sequence: float32
- name: prompt_sample_raw_units
sequence: int32
- name: positive_continuation_tokenwise_loss
sequence: float32
- name: positive_continuation_raw_units
dtype:
- name: hubert
dtype: string
- name: negative_continuation_tokenwise_loss
sequence: float32
- name: negative_continuation_raw_units
dtype:
- name: hubert
dtype: string
- name: model_generated_continuation
dtype:
audio:
sampling_rate: 16000
- name: code_frame_rate
dtype: int64
- name: code_depth
dtype: int64
- name: offset
sequence: int64
- name: model_sampling_rate
sequence: int64
- name: ppl_sanity
dtype: int64
splits:
- name: train
num_bytes: 235801949
num_examples: 200
download_size: 235801949
dataset_size: 235801949
- config_name: gender_consistency
features:
- name: task
dtype: string
- name: ind
dtype: int64
- name: positive_audio
dtype: audio
- name: negative_audio
dtype: audio
- name: audio_transition_s
dtype: int64
- name: prompt_audio
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_positive
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_negative
dtype:
audio:
sampling_rate: 16000
- name: negative_audio_sanity
dtype:
audio:
sampling_rate: 16000
- name: positive_sample_tokenwise_loss
sequence: float32
- name: positive_sample_raw_units
dtype:
- name: hubert
dtype: string
- name: negative_sample_tokenwise_loss
sequence: float32
- name: negative_sample_raw_units
dtype:
- name: hubert
dtype: string
- name: prompt_sample_tokenwise_loss
sequence: float32
- name: prompt_sample_raw_units
sequence: int32
- name: positive_continuation_tokenwise_loss
sequence: float32
- name: positive_continuation_raw_units
dtype:
- name: hubert
dtype: string
- name: negative_continuation_tokenwise_loss
sequence: float32
- name: negative_continuation_raw_units
dtype:
- name: hubert
dtype: string
- name: model_generated_continuation
dtype:
audio:
sampling_rate: 16000
- name: code_frame_rate
dtype: int64
- name: code_depth
dtype: int64
- name: offset
sequence: int64
- name: model_sampling_rate
sequence: int64
- name: ppl_sanity
dtype: int64
splits:
- name: train
num_bytes: 234300317
num_examples: 200
download_size: 234300317
dataset_size: 234300317
- config_name: rir_consistency
features:
- name: task
dtype: string
- name: ind
dtype: int64
- name: positive_audio
dtype: audio
- name: negative_audio
dtype: audio
- name: audio_transition_s
dtype: int64
- name: prompt_audio
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_positive
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_negative
dtype:
audio:
sampling_rate: 16000
- name: negative_audio_sanity
dtype:
audio:
sampling_rate: 16000
- name: positive_sample_tokenwise_loss
sequence: float32
- name: positive_sample_raw_units
dtype:
- name: hubert
dtype: string
- name: negative_sample_tokenwise_loss
sequence: float32
- name: negative_sample_raw_units
dtype:
- name: hubert
dtype: string
- name: prompt_sample_tokenwise_loss
sequence: float32
- name: prompt_sample_raw_units
sequence: int32
- name: positive_continuation_tokenwise_loss
sequence: float32
- name: positive_continuation_raw_units
dtype:
- name: hubert
dtype: string
- name: negative_continuation_tokenwise_loss
sequence: float32
- name: negative_continuation_raw_units
dtype:
- name: hubert
dtype: string
- name: model_generated_continuation
dtype:
audio:
sampling_rate: 16000
- name: code_frame_rate
dtype: int64
- name: code_depth
dtype: int64
- name: offset
sequence: int64
- name: model_sampling_rate
sequence: int64
- name: ppl_sanity
dtype: int64
splits:
- name: train
num_bytes: 217527612
num_examples: 200
download_size: 217527612
dataset_size: 217527612
- config_name: sentiment_alignment
features:
- name: task
dtype: string
- name: ind
dtype: int64
- name: positive_audio
dtype: audio
- name: negative_audio
dtype: audio
- name: prompt_audio
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_positive
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_negative
dtype:
audio:
sampling_rate: 16000
- name: negative_audio_sanity
dtype:
audio:
sampling_rate: 16000
- name: positive_sample_tokenwise_loss
sequence: float32
- name: positive_sample_raw_units
dtype:
- name: hubert
dtype: string
- name: negative_sample_tokenwise_loss
sequence: float32
- name: negative_sample_raw_units
dtype:
- name: hubert
dtype: string
- name: code_frame_rate
dtype: int64
- name: code_depth
dtype: int64
- name: offset
sequence: int64
- name: model_sampling_rate
sequence: int64
- name: ppl_sanity
dtype: int64
- name: model_generated_continuation
dtype:
audio:
sampling_rate: 16000
splits:
- name: train
num_bytes: 46603631
num_examples: 200
download_size: 46603631
dataset_size: 46603631
- config_name: sentiment_consistency
features:
- name: task
dtype: string
- name: ind
dtype: int64
- name: positive_audio
dtype: audio
- name: negative_audio
dtype: audio
- name: audio_transition_s
dtype: int64
- name: prompt_audio
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_positive
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_negative
dtype:
audio:
sampling_rate: 16000
- name: negative_audio_sanity
dtype:
audio:
sampling_rate: 16000
- name: positive_sample_tokenwise_loss
sequence: float32
- name: positive_sample_raw_units
dtype:
- name: hubert
dtype: string
- name: negative_sample_tokenwise_loss
sequence: float32
- name: negative_sample_raw_units
dtype:
- name: hubert
dtype: string
- name: prompt_sample_tokenwise_loss
sequence: float32
- name: prompt_sample_raw_units
sequence: int32
- name: positive_continuation_tokenwise_loss
sequence: float32
- name: positive_continuation_raw_units
dtype:
- name: hubert
dtype: string
- name: negative_continuation_tokenwise_loss
sequence: float32
- name: negative_continuation_raw_units
dtype:
- name: hubert
dtype: string
- name: model_generated_continuation
dtype:
audio:
sampling_rate: 16000
- name: code_frame_rate
dtype: int64
- name: code_depth
dtype: int64
- name: offset
sequence: int64
- name: model_sampling_rate
sequence: int64
- name: ppl_sanity
dtype: int64
splits:
- name: train
num_bytes: 231210614
num_examples: 200
download_size: 231210614
dataset_size: 231210614
- config_name: speaker_consistency
features:
- name: task
dtype: string
- name: ind
dtype: int64
- name: positive_audio
dtype: audio
- name: negative_audio
dtype: audio
- name: audio_transition_s
dtype: int64
- name: prompt_audio
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_positive
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_negative
dtype:
audio:
sampling_rate: 16000
- name: negative_audio_sanity
dtype:
audio:
sampling_rate: 16000
- name: positive_sample_tokenwise_loss
sequence: float32
- name: positive_sample_raw_units
dtype:
- name: hubert
dtype: string
- name: negative_sample_tokenwise_loss
sequence: float32
- name: negative_sample_raw_units
dtype:
- name: hubert
dtype: string
- name: prompt_sample_tokenwise_loss
sequence: float32
- name: prompt_sample_raw_units
sequence: int32
- name: positive_continuation_tokenwise_loss
sequence: float32
- name: positive_continuation_raw_units
dtype:
- name: hubert
dtype: string
- name: negative_continuation_tokenwise_loss
sequence: float32
- name: negative_continuation_raw_units
dtype:
- name: hubert
dtype: string
- name: model_generated_continuation
dtype:
audio:
sampling_rate: 16000
- name: code_frame_rate
dtype: int64
- name: code_depth
dtype: int64
- name: offset
sequence: int64
- name: model_sampling_rate
sequence: int64
- name: ppl_sanity
dtype: int64
splits:
- name: train
num_bytes: 235262052
num_examples: 200
download_size: 235262052
dataset_size: 235262052
---
# SALMon Normalized Dataset
This repo preserves the SALMon per-config folder layout while normalizing
mismatched schema details across model families.
提供机构:
SpeechPPL
搜集汇总
数据集介绍

构建方式
SALMon_TWIST-1.3B-normalized数据集在语音生成模型的评估与对齐研究中应运而生,其构建基于TWIST-1.3B大模型的音频生成能力。该数据集以HuggingFace多配置(config)形式组织,涵盖bg_alignment、gender_consistency、sentiment_consistency等八个子集,每个配置包含200条训练样本。数据构建流程中,针对每一种音频场景(如背景噪音一致性与否、说话人性别是否连贯等),模型分别生成正例和负例的音频片段,并辅以提示音频、连续音频及对应的逐词损失(tokenwise loss)与HuBERT单元表征。此外,所有音频均统一至16kHz采样率,并记录了偏移量、帧率与模型采样率等结构化元信息,最终构成一个体系完整的评估与训练资源。
特点
该数据集的核心特色在于其精细化的多维评估框架与一致性设计。不同于传统单一基准,此数据集围绕背景噪音、领域、性别、房间脉冲响应(RIR)、情感、说话人等多个维度,分别设立对齐(alignment)与一致性(consistency)两类任务,实现对生成音频在语义保真度与音色连贯性上的深度剖析。每个样本均包含正负样本对及由原始模型产生的延续音频,并结合基于HuBERT的离散单元表征与逐令牌损失计算,使得音频质量和属性控制能力可被精确量化和比较。此外,通过规范化处理跨模型架构的schema差异,增强了数据集的可迁移性与通用性。
使用方法
使用SALMon_TWIST-1.3B-normalized数据集时,研究者可依据具体评估目标选择对应配置(config),如使用gender_consistency验证语音生成中性别属性的保持能力。通过HuggingFace Datasets库加载指定配置的训练集,可获取正负音频样本、提示音频及HuBERT单元序列等字段。典型应用包括:利用逐令牌损失比较正负样本的模型困惑度差异,从而评估生成质量;或通过positive_audio与negative_audio的对比实验量化模型在属性一致性上的表现。注意加载时需指定配置名称(如'speaker_consistency'),且音频数据以16kHz采样率存储,可直接用于下游训练与分析。
背景与挑战
背景概述
SALMon_TWIST-1.3B-normalized数据集由语音与音乐分析领域的研究团队创建,旨在为生成式语音模型提供细粒度的评估基准。该数据集于近期发布,其核心研究问题在于量化语音生成中不同属性(如背景噪声、说话人身份、情感、性别、混响等)的保持与一致性。通过对模型生成的音频与原始音频进行逐词损失、原始单元等维度的对比,该数据集提供了超过1600个精心设计的样本,覆盖背景对齐、情感一致性、说话人一致性等八项子任务。其影响力体现在为评估大型语音模型(如TWIST-1.3B)在复杂声学环境下的表现提供了标准化平台,推动了可控语音生成技术的量化研究。
当前挑战
该数据集所解决的领域挑战主要在于生成式语音模型在多属性约束下的评估难题,即模型在保持音频内容的同时,需精确维持或调控背景、说话人、情感等非语义特征,而现有指标难以捕获此类细粒度属性的一致性。构建过程中的挑战则体现于多方面的归一化工作:首先,需跨不同模型家族对齐音频采样率、帧率、代码深度等异构架构特征,确保比较公平;其次,通过归一化格式统一了逐词损失、原始单元等结构化字段,克服了原始SALMon数据中因模型差异导致的模式不匹配问题;最后,人工设计涵盖八种属性的正负样本对并引入理智检查,平衡了评估的全面性与数据质量。
常用场景
经典使用场景
在语音生成与音频理解领域,SALMon_TWIST-1.3B-normalized数据集为评估和优化大规模神经音频模型提供了精细化的评测基准。其经典使用场景聚焦于衡量模型在语音延续任务中对多种声学与语义属性的保持能力,涵盖背景噪音、说话人身份、情感韵律、房间冲激响应以及域一致性等关键维度。通过精心设计的正负样本对与对比损失数据,研究者能够系统性地测试模型在给定语音片段后生成合理且属性一致的后续音频的能力,从而推动语音生成模型向更细腻、更可控的方向演进。
衍生相关工作
围绕该数据集的衍生工作中,最具代表性的包括基于对比学习的语音表征优化研究和多任务属性一致性框架的提出。研究者借助其对正负样本对的精心标注,探索了在HuBERT等自监督语音特征空间中通过token-wise损失实现属性解耦的方法。此外,该数据集激发了一系列旨在统一评测语音生成一致性的基准测试工作,推动了将属性对齐损失融入生成模型训练流程的尝试。这些工作不仅丰富了语音生成模型的可控性研究,也为更进一步探索跨语言、跨文化情境下的语音属性保持奠定了方法论基础。
数据集最近研究
最新研究方向
当前,SALMon_TWIST-1.3B-normalized数据集聚焦于语音生成模型的多维度可控性与一致性评估,成为神经音频合成领域的前沿验证平台。该数据集通过精细划分的配置,涵盖背景对齐、各层面一致性(语域、性别、混响、情感、说话人)及情感对齐等任务,为评估和提升语音语言模型在保持特定声学属性(如情感韵律、发音人身份)上的表现提供了标准化的基准。特别是在研究热潮中,如何确保模型在延续语音时既维持语境逻辑又不失声学连贯性,是该数据集着力解决的核心挑战。其提供的正负样本对、逐词损失及HuBERT单元等结构化特征,为构建鲁棒、可控的音频生成系统奠定了重要基石,对推动人机交互和虚拟助手的自然度与可信度具有显著意义。
以上内容由遇见数据集搜集并总结生成



