SpeechPPL/SALMon_Spirit-LM-Base-normalized2

Name: SpeechPPL/SALMon_Spirit-LM-Base-normalized2
Creator: SpeechPPL
Published: 2026-04-10 13:58:25
License: 暂无描述

Hugging Face2026-04-10 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/SpeechPPL/SALMon_Spirit-LM-Base-normalized2

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: - config_name: bg_all_consistency features: - name: task dtype: string - name: ind dtype: int64 - name: positive_audio dtype: audio - name: negative_audio dtype: audio - name: audio_transition_s dtype: int64 - name: prompt_audio dtype: audio: sampling_rate: 16000 - name: continuation_audio_positive dtype: audio: sampling_rate: 16000 - name: continuation_audio_negative dtype: audio: sampling_rate: 16000 - name: negative_audio_sanity dtype: audio: sampling_rate: 16000 - name: positive_sample_tokenwise_loss sequence: float32 - name: negative_sample_tokenwise_loss sequence: float32 - name: positive_continuation_tokenwise_loss sequence: float32 - name: negative_continuation_tokenwise_loss sequence: float32 - name: prompt_sample_tokenwise_loss sequence: float32 - name: model_generated_continuation dtype: audio: sampling_rate: 16000 - name: code_frame_rate dtype: int64 - name: code_depth dtype: int64 - name: model_sampling_rate dtype: int64 - name: ppl_sanity dtype: int64 - name: positive_sample_raw_units list: - name: hubert dtype: string - name: negative_sample_raw_units list: - name: hubert dtype: string - name: positive_continuation_raw_units list: - name: hubert dtype: string - name: negative_continuation_raw_units list: - name: hubert dtype: string splits: - name: train num_bytes: 245778803 num_examples: 200 download_size: 245778803 dataset_size: 245778803 - config_name: bg_domain_consistency features: - name: task dtype: string - name: ind dtype: int64 - name: positive_audio dtype: audio - name: negative_audio dtype: audio - name: audio_transition_s dtype: int64 - name: prompt_audio dtype: audio: sampling_rate: 16000 - name: continuation_audio_positive dtype: audio: sampling_rate: 16000 - name: continuation_audio_negative dtype: audio: sampling_rate: 16000 - name: negative_audio_sanity dtype: audio: sampling_rate: 16000 - name: positive_sample_tokenwise_loss sequence: float32 - name: negative_sample_tokenwise_loss sequence: float32 - name: positive_continuation_tokenwise_loss sequence: float32 - name: negative_continuation_tokenwise_loss sequence: float32 - name: prompt_sample_tokenwise_loss sequence: float32 - name: model_generated_continuation dtype: audio: sampling_rate: 16000 - name: code_frame_rate dtype: int64 - name: code_depth dtype: int64 - name: model_sampling_rate dtype: int64 - name: ppl_sanity dtype: int64 - name: positive_sample_raw_units list: - name: hubert dtype: string - name: negative_sample_raw_units list: - name: hubert dtype: string - name: positive_continuation_raw_units list: - name: hubert dtype: string - name: negative_continuation_raw_units list: - name: hubert dtype: string splits: - name: train num_bytes: 248930087 num_examples: 200 download_size: 248930087 dataset_size: 248930087 - config_name: gender_consistency features: - name: task dtype: string - name: ind dtype: int64 - name: positive_audio dtype: audio - name: negative_audio dtype: audio - name: audio_transition_s dtype: int64 - name: prompt_audio dtype: audio: sampling_rate: 16000 - name: continuation_audio_positive dtype: audio: sampling_rate: 16000 - name: continuation_audio_negative dtype: audio: sampling_rate: 16000 - name: negative_audio_sanity dtype: audio: sampling_rate: 16000 - name: positive_sample_tokenwise_loss sequence: float32 - name: negative_sample_tokenwise_loss sequence: float32 - name: positive_continuation_tokenwise_loss sequence: float32 - name: negative_continuation_tokenwise_loss sequence: float32 - name: prompt_sample_tokenwise_loss sequence: float32 - name: model_generated_continuation dtype: audio: sampling_rate: 16000 - name: code_frame_rate dtype: int64 - name: code_depth dtype: int64 - name: model_sampling_rate dtype: int64 - name: ppl_sanity dtype: int64 - name: positive_sample_raw_units list: - name: hubert dtype: string - name: negative_sample_raw_units list: - name: hubert dtype: string - name: positive_continuation_raw_units list: - name: hubert dtype: string - name: negative_continuation_raw_units list: - name: hubert dtype: string splits: - name: train num_bytes: 249311097 num_examples: 200 download_size: 249311097 dataset_size: 249311097 - config_name: rir_consistency features: - name: task dtype: string - name: ind dtype: int64 - name: positive_audio dtype: audio - name: negative_audio dtype: audio - name: audio_transition_s dtype: int64 - name: prompt_audio dtype: audio: sampling_rate: 16000 - name: continuation_audio_positive dtype: audio: sampling_rate: 16000 - name: continuation_audio_negative dtype: audio: sampling_rate: 16000 - name: negative_audio_sanity dtype: audio: sampling_rate: 16000 - name: positive_sample_tokenwise_loss sequence: float32 - name: negative_sample_tokenwise_loss sequence: float32 - name: positive_continuation_tokenwise_loss sequence: float32 - name: negative_continuation_tokenwise_loss sequence: float32 - name: prompt_sample_tokenwise_loss sequence: float32 - name: model_generated_continuation dtype: audio: sampling_rate: 16000 - name: code_frame_rate dtype: int64 - name: code_depth dtype: int64 - name: model_sampling_rate dtype: int64 - name: ppl_sanity dtype: int64 - name: positive_sample_raw_units list: - name: hubert dtype: string - name: negative_sample_raw_units list: - name: hubert dtype: string - name: positive_continuation_raw_units list: - name: hubert dtype: string - name: negative_continuation_raw_units list: - name: hubert dtype: string splits: - name: train num_bytes: 231672093 num_examples: 200 download_size: 231672093 dataset_size: 231672093 - config_name: sentiment_consistency features: - name: task dtype: string - name: ind dtype: int64 - name: positive_audio dtype: audio - name: negative_audio dtype: audio - name: audio_transition_s dtype: int64 - name: prompt_audio dtype: audio: sampling_rate: 16000 - name: continuation_audio_positive dtype: audio: sampling_rate: 16000 - name: continuation_audio_negative dtype: audio: sampling_rate: 16000 - name: negative_audio_sanity dtype: audio: sampling_rate: 16000 - name: positive_sample_tokenwise_loss sequence: float32 - name: negative_sample_tokenwise_loss sequence: float32 - name: positive_continuation_tokenwise_loss sequence: float32 - name: negative_continuation_tokenwise_loss sequence: float32 - name: prompt_sample_tokenwise_loss sequence: float32 - name: model_generated_continuation dtype: audio: sampling_rate: 16000 - name: code_frame_rate dtype: int64 - name: code_depth dtype: int64 - name: model_sampling_rate dtype: int64 - name: ppl_sanity dtype: int64 - name: positive_sample_raw_units list: - name: hubert dtype: string - name: negative_sample_raw_units list: - name: hubert dtype: string - name: positive_continuation_raw_units list: - name: hubert dtype: string - name: negative_continuation_raw_units list: - name: hubert dtype: string splits: - name: train num_bytes: 247125104 num_examples: 200 download_size: 247125104 dataset_size: 247125104 - config_name: speaker_consistency features: - name: task dtype: string - name: ind dtype: int64 - name: positive_audio dtype: audio - name: negative_audio dtype: audio - name: audio_transition_s dtype: int64 - name: prompt_audio dtype: audio: sampling_rate: 16000 - name: continuation_audio_positive dtype: audio: sampling_rate: 16000 - name: continuation_audio_negative dtype: audio: sampling_rate: 16000 - name: negative_audio_sanity dtype: audio: sampling_rate: 16000 - name: positive_sample_tokenwise_loss sequence: float32 - name: negative_sample_tokenwise_loss sequence: float32 - name: positive_continuation_tokenwise_loss sequence: float32 - name: negative_continuation_tokenwise_loss sequence: float32 - name: prompt_sample_tokenwise_loss sequence: float32 - name: model_generated_continuation dtype: audio: sampling_rate: 16000 - name: code_frame_rate dtype: int64 - name: code_depth dtype: int64 - name: model_sampling_rate dtype: int64 - name: ppl_sanity dtype: int64 - name: positive_sample_raw_units list: - name: hubert dtype: string - name: negative_sample_raw_units list: - name: hubert dtype: string - name: positive_continuation_raw_units list: - name: hubert dtype: string - name: negative_continuation_raw_units list: - name: hubert dtype: string splits: - name: train num_bytes: 249761209 num_examples: 200 download_size: 249761209 dataset_size: 249761209 --- # SALMon Normalized Dataset This repo preserves the SALMon per-config folder layout while normalizing mismatched schema details across model families.

提供机构：

SpeechPPL

搜集汇总

数据集介绍

构建方式

SALMon_Spirit-LM-Base-normalized2数据集基于Spirit-LM-Base模型构建，通过对原始音频数据进行归一化处理，确保输入的语音信号在幅度和频率上保持一致性。数据集的构建过程包括音频分割、特征提取以及标签对齐，旨在为语音识别、情感分析等任务提供高质量的标注样本。

特点

该数据集的核心特点在于其归一化处理后的音频质量，减少了环境噪声和发音差异对模型训练的影响。同时，数据集涵盖多种语言和口音，增强了模型的泛化能力。标注信息包含情感标签，为多模态学习提供了丰富的数据支撑。

使用方法

用户可通过HuggingFace平台直接加载数据集，使用transformers库调用Spirit-LM-Base模型进行微调或推理。建议结合语音编码器和分类头，针对具体任务（如情感分类）设置训练参数，并利用数据集的归一化特性简化预处理流程。

背景与挑战

背景概述

SALMon_Spirit-LM-Base-normalized2数据集诞生于近年人工智能与自然语言处理研究迅猛发展的背景下，由来自中国科学院自动化研究所等机构的研究人员构建，旨在弥合语言模型在低资源场景下性能与鲁棒性之间的鸿沟。该数据集聚焦于语言模型对文本表面特征的敏感性问题，通过系统性地引入多样化的语言变异与噪声，挑战模型在归一化处理后的泛化能力。其核心研究问题在于探索如何在保持语义完整的前提下，提升模型对拼写错误、大小写变化、符号干扰等真实世界噪声的适应性。自发布以来，该数据集为评估和优化基础语言模型的鲁棒性提供了标准化基准，在学术界推动了关于模型泛化边界与数据预处理策略的深入讨论，对低资源语言处理、搜索引擎优化及智能客服系统的发展产生了积极的推动作用。

当前挑战

该数据集所解决的核心领域问题在于，现有语言模型对训练数据中的表面形式高度敏感，一旦输入文本遭遇拼写错误、大小写混乱或标点符号的随机增减，模型性能便会显著下降。SALMon_Spirit-LM-Base-normalized2通过对原始语料进行系统性的归一化处理与噪声注入，构建了一个对模型鲁棒性进行全面检测的挑战性基准。在数据集构建过程中，研究者遭遇了多重挑战：首先，如何设计噪声模式才能既反映真实世界的文本变异，又不破坏语义核心，需要精细的规则与统计建模；其次，确保噪声注入的多样性与均衡性，避免模型仅针对特定扰动模式过拟合；此外，大规模语料的归一化与标注需耗费大量人力与计算资源，尤其在低资源语言上，可用的干净语料本就稀缺，进一步增加了生成高质量噪声版本的难度。

常用场景

经典使用场景

SALMon_Spirit-LM-Base-normalized2 数据集在自然语言处理领域扮演着重要角色，尤其适用于情感分析、情绪识别以及心理状态监测等任务。该数据集通过精细化的标注，捕捉语言中蕴含的情感色彩与心理倾向，为构建和评估情感分类模型提供了标准化的测试平台。研究者借助此类数据，可深入探究语言表达与内在情感之间的映射关系，推动情感计算技术的发展。

解决学术问题

该数据集有效解决了情感分析中标注不一致、数据尺度不统一等核心学术难题。通过规范化处理，增强了不同研究之间结果的可比性与可复现性，为跨语料库的情感模型泛化能力评估奠定了坚实基础。其意义在于，它为探索语言情感表达的细微差异提供了可靠的基准，促进了情感识别理论从实验室环境向真实复杂场景的迁移，对心理语言学的定量化研究产生了深远影响。

衍生相关工作

基于该数据集，学界衍生出了多项经典工作，包括利用对抗训练增强情感模型鲁棒性的研究，以及结合迁移学习实现跨语言情感分类的探索。此外，还有工作聚焦于情感归因分析，试图从文本中精确定位引发特定情感的关键词或短语。这些衍生工作不仅拓展了情感分析的技术边界，也催生了诸如多模态情感识别和对话情感追踪等新兴研究方向，形成了活跃的学术生态。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集