SpeechPPL/SALMon_TWIST-1.3B-normalized

Name: SpeechPPL/SALMon_TWIST-1.3B-normalized
Creator: SpeechPPL
Published: 2026-04-10 14:35:46
License: 暂无描述

Hugging Face2026-04-10 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/SpeechPPL/SALMon_TWIST-1.3B-normalized

下载链接

链接失效反馈

官方服务：

资源简介：

--- configs: - config_name: bg_alignment data_files: - split: train path: bg_alignment/train-* - config_name: bg_all_consistency data_files: - split: train path: bg_all_consistency/train-* - config_name: bg_domain_consistency data_files: - split: train path: bg_domain_consistency/train-* - config_name: gender_consistency data_files: - split: train path: gender_consistency/train-* - config_name: rir_consistency data_files: - split: train path: rir_consistency/train-* - config_name: sentiment_alignment data_files: - split: train path: sentiment_alignment/train-* - config_name: sentiment_consistency data_files: - split: train path: sentiment_consistency/train-* - config_name: speaker_consistency data_files: - split: train path: speaker_consistency/train-* dataset_info: - config_name: bg_alignment features: - name: task dtype: string - name: ind dtype: int64 - name: positive_audio dtype: audio - name: negative_audio dtype: audio - name: prompt_audio dtype: audio: sampling_rate: 16000 - name: continuation_audio_positive dtype: audio: sampling_rate: 16000 - name: continuation_audio_negative dtype: audio: sampling_rate: 16000 - name: negative_audio_sanity dtype: audio: sampling_rate: 16000 - name: positive_sample_tokenwise_loss sequence: float32 - name: positive_sample_raw_units dtype: - name: hubert dtype: string - name: negative_sample_tokenwise_loss sequence: float32 - name: negative_sample_raw_units dtype: - name: hubert dtype: string - name: code_frame_rate dtype: int64 - name: code_depth dtype: int64 - name: offset sequence: int64 - name: model_sampling_rate sequence: int64 - name: ppl_sanity dtype: int64 - name: model_generated_continuation dtype: audio: sampling_rate: 16000 splits: - name: train num_bytes: 86798378 num_examples: 200 download_size: 86798378 dataset_size: 86798378 - config_name: bg_all_consistency features: - name: task dtype: string - name: ind dtype: int64 - name: positive_audio dtype: audio - name: negative_audio dtype: audio - name: audio_transition_s dtype: int64 - name: prompt_audio dtype: audio: sampling_rate: 16000 - name: continuation_audio_positive dtype: audio: sampling_rate: 16000 - name: continuation_audio_negative dtype: audio: sampling_rate: 16000 - name: negative_audio_sanity dtype: audio: sampling_rate: 16000 - name: positive_sample_tokenwise_loss sequence: float32 - name: positive_sample_raw_units dtype: - name: hubert dtype: string - name: negative_sample_tokenwise_loss sequence: float32 - name: negative_sample_raw_units dtype: - name: hubert dtype: string - name: prompt_sample_tokenwise_loss sequence: float32 - name: prompt_sample_raw_units sequence: int32 - name: positive_continuation_tokenwise_loss sequence: float32 - name: positive_continuation_raw_units dtype: - name: hubert dtype: string - name: negative_continuation_tokenwise_loss sequence: float32 - name: negative_continuation_raw_units dtype: - name: hubert dtype: string - name: model_generated_continuation dtype: audio: sampling_rate: 16000 - name: code_frame_rate dtype: int64 - name: code_depth dtype: int64 - name: offset sequence: int64 - name: model_sampling_rate sequence: int64 - name: ppl_sanity dtype: int64 splits: - name: train num_bytes: 233273821 num_examples: 200 download_size: 233273821 dataset_size: 233273821 - config_name: bg_domain_consistency features: - name: task dtype: string - name: ind dtype: int64 - name: positive_audio dtype: audio - name: negative_audio dtype: audio - name: audio_transition_s dtype: int64 - name: prompt_audio dtype: audio: sampling_rate: 16000 - name: continuation_audio_positive dtype: audio: sampling_rate: 16000 - name: continuation_audio_negative dtype: audio: sampling_rate: 16000 - name: negative_audio_sanity dtype: audio: sampling_rate: 16000 - name: positive_sample_tokenwise_loss sequence: float32 - name: positive_sample_raw_units dtype: - name: hubert dtype: string - name: negative_sample_tokenwise_loss sequence: float32 - name: negative_sample_raw_units dtype: - name: hubert dtype: string - name: prompt_sample_tokenwise_loss sequence: float32 - name: prompt_sample_raw_units sequence: int32 - name: positive_continuation_tokenwise_loss sequence: float32 - name: positive_continuation_raw_units dtype: - name: hubert dtype: string - name: negative_continuation_tokenwise_loss sequence: float32 - name: negative_continuation_raw_units dtype: - name: hubert dtype: string - name: model_generated_continuation dtype: audio: sampling_rate: 16000 - name: code_frame_rate dtype: int64 - name: code_depth dtype: int64 - name: offset sequence: int64 - name: model_sampling_rate sequence: int64 - name: ppl_sanity dtype: int64 splits: - name: train num_bytes: 235801949 num_examples: 200 download_size: 235801949 dataset_size: 235801949 - config_name: gender_consistency features: - name: task dtype: string - name: ind dtype: int64 - name: positive_audio dtype: audio - name: negative_audio dtype: audio - name: audio_transition_s dtype: int64 - name: prompt_audio dtype: audio: sampling_rate: 16000 - name: continuation_audio_positive dtype: audio: sampling_rate: 16000 - name: continuation_audio_negative dtype: audio: sampling_rate: 16000 - name: negative_audio_sanity dtype: audio: sampling_rate: 16000 - name: positive_sample_tokenwise_loss sequence: float32 - name: positive_sample_raw_units dtype: - name: hubert dtype: string - name: negative_sample_tokenwise_loss sequence: float32 - name: negative_sample_raw_units dtype: - name: hubert dtype: string - name: prompt_sample_tokenwise_loss sequence: float32 - name: prompt_sample_raw_units sequence: int32 - name: positive_continuation_tokenwise_loss sequence: float32 - name: positive_continuation_raw_units dtype: - name: hubert dtype: string - name: negative_continuation_tokenwise_loss sequence: float32 - name: negative_continuation_raw_units dtype: - name: hubert dtype: string - name: model_generated_continuation dtype: audio: sampling_rate: 16000 - name: code_frame_rate dtype: int64 - name: code_depth dtype: int64 - name: offset sequence: int64 - name: model_sampling_rate sequence: int64 - name: ppl_sanity dtype: int64 splits: - name: train num_bytes: 234300317 num_examples: 200 download_size: 234300317 dataset_size: 234300317 - config_name: rir_consistency features: - name: task dtype: string - name: ind dtype: int64 - name: positive_audio dtype: audio - name: negative_audio dtype: audio - name: audio_transition_s dtype: int64 - name: prompt_audio dtype: audio: sampling_rate: 16000 - name: continuation_audio_positive dtype: audio: sampling_rate: 16000 - name: continuation_audio_negative dtype: audio: sampling_rate: 16000 - name: negative_audio_sanity dtype: audio: sampling_rate: 16000 - name: positive_sample_tokenwise_loss sequence: float32 - name: positive_sample_raw_units dtype: - name: hubert dtype: string - name: negative_sample_tokenwise_loss sequence: float32 - name: negative_sample_raw_units dtype: - name: hubert dtype: string - name: prompt_sample_tokenwise_loss sequence: float32 - name: prompt_sample_raw_units sequence: int32 - name: positive_continuation_tokenwise_loss sequence: float32 - name: positive_continuation_raw_units dtype: - name: hubert dtype: string - name: negative_continuation_tokenwise_loss sequence: float32 - name: negative_continuation_raw_units dtype: - name: hubert dtype: string - name: model_generated_continuation dtype: audio: sampling_rate: 16000 - name: code_frame_rate dtype: int64 - name: code_depth dtype: int64 - name: offset sequence: int64 - name: model_sampling_rate sequence: int64 - name: ppl_sanity dtype: int64 splits: - name: train num_bytes: 217527612 num_examples: 200 download_size: 217527612 dataset_size: 217527612 - config_name: sentiment_alignment features: - name: task dtype: string - name: ind dtype: int64 - name: positive_audio dtype: audio - name: negative_audio dtype: audio - name: prompt_audio dtype: audio: sampling_rate: 16000 - name: continuation_audio_positive dtype: audio: sampling_rate: 16000 - name: continuation_audio_negative dtype: audio: sampling_rate: 16000 - name: negative_audio_sanity dtype: audio: sampling_rate: 16000 - name: positive_sample_tokenwise_loss sequence: float32 - name: positive_sample_raw_units dtype: - name: hubert dtype: string - name: negative_sample_tokenwise_loss sequence: float32 - name: negative_sample_raw_units dtype: - name: hubert dtype: string - name: code_frame_rate dtype: int64 - name: code_depth dtype: int64 - name: offset sequence: int64 - name: model_sampling_rate sequence: int64 - name: ppl_sanity dtype: int64 - name: model_generated_continuation dtype: audio: sampling_rate: 16000 splits: - name: train num_bytes: 46603631 num_examples: 200 download_size: 46603631 dataset_size: 46603631 - config_name: sentiment_consistency features: - name: task dtype: string - name: ind dtype: int64 - name: positive_audio dtype: audio - name: negative_audio dtype: audio - name: audio_transition_s dtype: int64 - name: prompt_audio dtype: audio: sampling_rate: 16000 - name: continuation_audio_positive dtype: audio: sampling_rate: 16000 - name: continuation_audio_negative dtype: audio: sampling_rate: 16000 - name: negative_audio_sanity dtype: audio: sampling_rate: 16000 - name: positive_sample_tokenwise_loss sequence: float32 - name: positive_sample_raw_units dtype: - name: hubert dtype: string - name: negative_sample_tokenwise_loss sequence: float32 - name: negative_sample_raw_units dtype: - name: hubert dtype: string - name: prompt_sample_tokenwise_loss sequence: float32 - name: prompt_sample_raw_units sequence: int32 - name: positive_continuation_tokenwise_loss sequence: float32 - name: positive_continuation_raw_units dtype: - name: hubert dtype: string - name: negative_continuation_tokenwise_loss sequence: float32 - name: negative_continuation_raw_units dtype: - name: hubert dtype: string - name: model_generated_continuation dtype: audio: sampling_rate: 16000 - name: code_frame_rate dtype: int64 - name: code_depth dtype: int64 - name: offset sequence: int64 - name: model_sampling_rate sequence: int64 - name: ppl_sanity dtype: int64 splits: - name: train num_bytes: 231210614 num_examples: 200 download_size: 231210614 dataset_size: 231210614 - config_name: speaker_consistency features: - name: task dtype: string - name: ind dtype: int64 - name: positive_audio dtype: audio - name: negative_audio dtype: audio - name: audio_transition_s dtype: int64 - name: prompt_audio dtype: audio: sampling_rate: 16000 - name: continuation_audio_positive dtype: audio: sampling_rate: 16000 - name: continuation_audio_negative dtype: audio: sampling_rate: 16000 - name: negative_audio_sanity dtype: audio: sampling_rate: 16000 - name: positive_sample_tokenwise_loss sequence: float32 - name: positive_sample_raw_units dtype: - name: hubert dtype: string - name: negative_sample_tokenwise_loss sequence: float32 - name: negative_sample_raw_units dtype: - name: hubert dtype: string - name: prompt_sample_tokenwise_loss sequence: float32 - name: prompt_sample_raw_units sequence: int32 - name: positive_continuation_tokenwise_loss sequence: float32 - name: positive_continuation_raw_units dtype: - name: hubert dtype: string - name: negative_continuation_tokenwise_loss sequence: float32 - name: negative_continuation_raw_units dtype: - name: hubert dtype: string - name: model_generated_continuation dtype: audio: sampling_rate: 16000 - name: code_frame_rate dtype: int64 - name: code_depth dtype: int64 - name: offset sequence: int64 - name: model_sampling_rate sequence: int64 - name: ppl_sanity dtype: int64 splits: - name: train num_bytes: 235262052 num_examples: 200 download_size: 235262052 dataset_size: 235262052 --- # SALMon Normalized Dataset This repo preserves the SALMon per-config folder layout while normalizing mismatched schema details across model families.

提供机构：

SpeechPPL

搜集汇总

数据集介绍

构建方式

SALMon_TWIST-1.3B-normalized数据集在语音生成模型的评估与对齐研究中应运而生，其构建基于TWIST-1.3B大模型的音频生成能力。该数据集以HuggingFace多配置（config）形式组织，涵盖bg_alignment、gender_consistency、sentiment_consistency等八个子集，每个配置包含200条训练样本。数据构建流程中，针对每一种音频场景（如背景噪音一致性与否、说话人性别是否连贯等），模型分别生成正例和负例的音频片段，并辅以提示音频、连续音频及对应的逐词损失（tokenwise loss）与HuBERT单元表征。此外，所有音频均统一至16kHz采样率，并记录了偏移量、帧率与模型采样率等结构化元信息，最终构成一个体系完整的评估与训练资源。

特点

该数据集的核心特色在于其精细化的多维评估框架与一致性设计。不同于传统单一基准，此数据集围绕背景噪音、领域、性别、房间脉冲响应（RIR）、情感、说话人等多个维度，分别设立对齐（alignment）与一致性（consistency）两类任务，实现对生成音频在语义保真度与音色连贯性上的深度剖析。每个样本均包含正负样本对及由原始模型产生的延续音频，并结合基于HuBERT的离散单元表征与逐令牌损失计算，使得音频质量和属性控制能力可被精确量化和比较。此外，通过规范化处理跨模型架构的schema差异，增强了数据集的可迁移性与通用性。

使用方法

使用SALMon_TWIST-1.3B-normalized数据集时，研究者可依据具体评估目标选择对应配置（config），如使用gender_consistency验证语音生成中性别属性的保持能力。通过HuggingFace Datasets库加载指定配置的训练集，可获取正负音频样本、提示音频及HuBERT单元序列等字段。典型应用包括：利用逐令牌损失比较正负样本的模型困惑度差异，从而评估生成质量；或通过positive_audio与negative_audio的对比实验量化模型在属性一致性上的表现。注意加载时需指定配置名称（如'speaker_consistency'），且音频数据以16kHz采样率存储，可直接用于下游训练与分析。

背景与挑战

背景概述

SALMon_TWIST-1.3B-normalized数据集由语音与音乐分析领域的研究团队创建，旨在为生成式语音模型提供细粒度的评估基准。该数据集于近期发布，其核心研究问题在于量化语音生成中不同属性（如背景噪声、说话人身份、情感、性别、混响等）的保持与一致性。通过对模型生成的音频与原始音频进行逐词损失、原始单元等维度的对比，该数据集提供了超过1600个精心设计的样本，覆盖背景对齐、情感一致性、说话人一致性等八项子任务。其影响力体现在为评估大型语音模型（如TWIST-1.3B）在复杂声学环境下的表现提供了标准化平台，推动了可控语音生成技术的量化研究。

当前挑战

该数据集所解决的领域挑战主要在于生成式语音模型在多属性约束下的评估难题，即模型在保持音频内容的同时，需精确维持或调控背景、说话人、情感等非语义特征，而现有指标难以捕获此类细粒度属性的一致性。构建过程中的挑战则体现于多方面的归一化工作：首先，需跨不同模型家族对齐音频采样率、帧率、代码深度等异构架构特征，确保比较公平；其次，通过归一化格式统一了逐词损失、原始单元等结构化字段，克服了原始SALMon数据中因模型差异导致的模式不匹配问题；最后，人工设计涵盖八种属性的正负样本对并引入理智检查，平衡了评估的全面性与数据质量。

常用场景

经典使用场景

在语音生成与音频理解领域，SALMon_TWIST-1.3B-normalized数据集为评估和优化大规模神经音频模型提供了精细化的评测基准。其经典使用场景聚焦于衡量模型在语音延续任务中对多种声学与语义属性的保持能力，涵盖背景噪音、说话人身份、情感韵律、房间冲激响应以及域一致性等关键维度。通过精心设计的正负样本对与对比损失数据，研究者能够系统性地测试模型在给定语音片段后生成合理且属性一致的后续音频的能力，从而推动语音生成模型向更细腻、更可控的方向演进。

衍生相关工作

围绕该数据集的衍生工作中，最具代表性的包括基于对比学习的语音表征优化研究和多任务属性一致性框架的提出。研究者借助其对正负样本对的精心标注，探索了在HuBERT等自监督语音特征空间中通过token-wise损失实现属性解耦的方法。此外，该数据集激发了一系列旨在统一评测语音生成一致性的基准测试工作，推动了将属性对齐损失融入生成模型训练流程的尝试。这些工作不仅丰富了语音生成模型的可控性研究，也为更进一步探索跨语言、跨文化情境下的语音属性保持奠定了方法论基础。

数据集最近研究