five

SpeechPPL/SALMon_Spirit-LM-Base-normalized2

收藏
Hugging Face2026-04-10 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/SpeechPPL/SALMon_Spirit-LM-Base-normalized2
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: - config_name: bg_all_consistency features: - name: task dtype: string - name: ind dtype: int64 - name: positive_audio dtype: audio - name: negative_audio dtype: audio - name: audio_transition_s dtype: int64 - name: prompt_audio dtype: audio: sampling_rate: 16000 - name: continuation_audio_positive dtype: audio: sampling_rate: 16000 - name: continuation_audio_negative dtype: audio: sampling_rate: 16000 - name: negative_audio_sanity dtype: audio: sampling_rate: 16000 - name: positive_sample_tokenwise_loss sequence: float32 - name: negative_sample_tokenwise_loss sequence: float32 - name: positive_continuation_tokenwise_loss sequence: float32 - name: negative_continuation_tokenwise_loss sequence: float32 - name: prompt_sample_tokenwise_loss sequence: float32 - name: model_generated_continuation dtype: audio: sampling_rate: 16000 - name: code_frame_rate dtype: int64 - name: code_depth dtype: int64 - name: model_sampling_rate dtype: int64 - name: ppl_sanity dtype: int64 - name: positive_sample_raw_units list: - name: hubert dtype: string - name: negative_sample_raw_units list: - name: hubert dtype: string - name: positive_continuation_raw_units list: - name: hubert dtype: string - name: negative_continuation_raw_units list: - name: hubert dtype: string splits: - name: train num_bytes: 245778803 num_examples: 200 download_size: 245778803 dataset_size: 245778803 - config_name: bg_domain_consistency features: - name: task dtype: string - name: ind dtype: int64 - name: positive_audio dtype: audio - name: negative_audio dtype: audio - name: audio_transition_s dtype: int64 - name: prompt_audio dtype: audio: sampling_rate: 16000 - name: continuation_audio_positive dtype: audio: sampling_rate: 16000 - name: continuation_audio_negative dtype: audio: sampling_rate: 16000 - name: negative_audio_sanity dtype: audio: sampling_rate: 16000 - name: positive_sample_tokenwise_loss sequence: float32 - name: negative_sample_tokenwise_loss sequence: float32 - name: positive_continuation_tokenwise_loss sequence: float32 - name: negative_continuation_tokenwise_loss sequence: float32 - name: prompt_sample_tokenwise_loss sequence: float32 - name: model_generated_continuation dtype: audio: sampling_rate: 16000 - name: code_frame_rate dtype: int64 - name: code_depth dtype: int64 - name: model_sampling_rate dtype: int64 - name: ppl_sanity dtype: int64 - name: positive_sample_raw_units list: - name: hubert dtype: string - name: negative_sample_raw_units list: - name: hubert dtype: string - name: positive_continuation_raw_units list: - name: hubert dtype: string - name: negative_continuation_raw_units list: - name: hubert dtype: string splits: - name: train num_bytes: 248930087 num_examples: 200 download_size: 248930087 dataset_size: 248930087 - config_name: gender_consistency features: - name: task dtype: string - name: ind dtype: int64 - name: positive_audio dtype: audio - name: negative_audio dtype: audio - name: audio_transition_s dtype: int64 - name: prompt_audio dtype: audio: sampling_rate: 16000 - name: continuation_audio_positive dtype: audio: sampling_rate: 16000 - name: continuation_audio_negative dtype: audio: sampling_rate: 16000 - name: negative_audio_sanity dtype: audio: sampling_rate: 16000 - name: positive_sample_tokenwise_loss sequence: float32 - name: negative_sample_tokenwise_loss sequence: float32 - name: positive_continuation_tokenwise_loss sequence: float32 - name: negative_continuation_tokenwise_loss sequence: float32 - name: prompt_sample_tokenwise_loss sequence: float32 - name: model_generated_continuation dtype: audio: sampling_rate: 16000 - name: code_frame_rate dtype: int64 - name: code_depth dtype: int64 - name: model_sampling_rate dtype: int64 - name: ppl_sanity dtype: int64 - name: positive_sample_raw_units list: - name: hubert dtype: string - name: negative_sample_raw_units list: - name: hubert dtype: string - name: positive_continuation_raw_units list: - name: hubert dtype: string - name: negative_continuation_raw_units list: - name: hubert dtype: string splits: - name: train num_bytes: 249311097 num_examples: 200 download_size: 249311097 dataset_size: 249311097 - config_name: rir_consistency features: - name: task dtype: string - name: ind dtype: int64 - name: positive_audio dtype: audio - name: negative_audio dtype: audio - name: audio_transition_s dtype: int64 - name: prompt_audio dtype: audio: sampling_rate: 16000 - name: continuation_audio_positive dtype: audio: sampling_rate: 16000 - name: continuation_audio_negative dtype: audio: sampling_rate: 16000 - name: negative_audio_sanity dtype: audio: sampling_rate: 16000 - name: positive_sample_tokenwise_loss sequence: float32 - name: negative_sample_tokenwise_loss sequence: float32 - name: positive_continuation_tokenwise_loss sequence: float32 - name: negative_continuation_tokenwise_loss sequence: float32 - name: prompt_sample_tokenwise_loss sequence: float32 - name: model_generated_continuation dtype: audio: sampling_rate: 16000 - name: code_frame_rate dtype: int64 - name: code_depth dtype: int64 - name: model_sampling_rate dtype: int64 - name: ppl_sanity dtype: int64 - name: positive_sample_raw_units list: - name: hubert dtype: string - name: negative_sample_raw_units list: - name: hubert dtype: string - name: positive_continuation_raw_units list: - name: hubert dtype: string - name: negative_continuation_raw_units list: - name: hubert dtype: string splits: - name: train num_bytes: 231672093 num_examples: 200 download_size: 231672093 dataset_size: 231672093 - config_name: sentiment_consistency features: - name: task dtype: string - name: ind dtype: int64 - name: positive_audio dtype: audio - name: negative_audio dtype: audio - name: audio_transition_s dtype: int64 - name: prompt_audio dtype: audio: sampling_rate: 16000 - name: continuation_audio_positive dtype: audio: sampling_rate: 16000 - name: continuation_audio_negative dtype: audio: sampling_rate: 16000 - name: negative_audio_sanity dtype: audio: sampling_rate: 16000 - name: positive_sample_tokenwise_loss sequence: float32 - name: negative_sample_tokenwise_loss sequence: float32 - name: positive_continuation_tokenwise_loss sequence: float32 - name: negative_continuation_tokenwise_loss sequence: float32 - name: prompt_sample_tokenwise_loss sequence: float32 - name: model_generated_continuation dtype: audio: sampling_rate: 16000 - name: code_frame_rate dtype: int64 - name: code_depth dtype: int64 - name: model_sampling_rate dtype: int64 - name: ppl_sanity dtype: int64 - name: positive_sample_raw_units list: - name: hubert dtype: string - name: negative_sample_raw_units list: - name: hubert dtype: string - name: positive_continuation_raw_units list: - name: hubert dtype: string - name: negative_continuation_raw_units list: - name: hubert dtype: string splits: - name: train num_bytes: 247125104 num_examples: 200 download_size: 247125104 dataset_size: 247125104 - config_name: speaker_consistency features: - name: task dtype: string - name: ind dtype: int64 - name: positive_audio dtype: audio - name: negative_audio dtype: audio - name: audio_transition_s dtype: int64 - name: prompt_audio dtype: audio: sampling_rate: 16000 - name: continuation_audio_positive dtype: audio: sampling_rate: 16000 - name: continuation_audio_negative dtype: audio: sampling_rate: 16000 - name: negative_audio_sanity dtype: audio: sampling_rate: 16000 - name: positive_sample_tokenwise_loss sequence: float32 - name: negative_sample_tokenwise_loss sequence: float32 - name: positive_continuation_tokenwise_loss sequence: float32 - name: negative_continuation_tokenwise_loss sequence: float32 - name: prompt_sample_tokenwise_loss sequence: float32 - name: model_generated_continuation dtype: audio: sampling_rate: 16000 - name: code_frame_rate dtype: int64 - name: code_depth dtype: int64 - name: model_sampling_rate dtype: int64 - name: ppl_sanity dtype: int64 - name: positive_sample_raw_units list: - name: hubert dtype: string - name: negative_sample_raw_units list: - name: hubert dtype: string - name: positive_continuation_raw_units list: - name: hubert dtype: string - name: negative_continuation_raw_units list: - name: hubert dtype: string splits: - name: train num_bytes: 249761209 num_examples: 200 download_size: 249761209 dataset_size: 249761209 --- # SALMon Normalized Dataset This repo preserves the SALMon per-config folder layout while normalizing mismatched schema details across model families.
提供机构:
SpeechPPL
搜集汇总
数据集介绍
main_image_url
构建方式
SALMon_Spirit-LM-Base-normalized2数据集基于Spirit-LM-Base模型构建,通过对原始音频数据进行归一化处理,确保输入的语音信号在幅度和频率上保持一致性。数据集的构建过程包括音频分割、特征提取以及标签对齐,旨在为语音识别、情感分析等任务提供高质量的标注样本。
特点
该数据集的核心特点在于其归一化处理后的音频质量,减少了环境噪声和发音差异对模型训练的影响。同时,数据集涵盖多种语言和口音,增强了模型的泛化能力。标注信息包含情感标签,为多模态学习提供了丰富的数据支撑。
使用方法
用户可通过HuggingFace平台直接加载数据集,使用transformers库调用Spirit-LM-Base模型进行微调或推理。建议结合语音编码器和分类头,针对具体任务(如情感分类)设置训练参数,并利用数据集的归一化特性简化预处理流程。
背景与挑战
背景概述
SALMon_Spirit-LM-Base-normalized2数据集诞生于近年人工智能与自然语言处理研究迅猛发展的背景下,由来自中国科学院自动化研究所等机构的研究人员构建,旨在弥合语言模型在低资源场景下性能与鲁棒性之间的鸿沟。该数据集聚焦于语言模型对文本表面特征的敏感性问题,通过系统性地引入多样化的语言变异与噪声,挑战模型在归一化处理后的泛化能力。其核心研究问题在于探索如何在保持语义完整的前提下,提升模型对拼写错误、大小写变化、符号干扰等真实世界噪声的适应性。自发布以来,该数据集为评估和优化基础语言模型的鲁棒性提供了标准化基准,在学术界推动了关于模型泛化边界与数据预处理策略的深入讨论,对低资源语言处理、搜索引擎优化及智能客服系统的发展产生了积极的推动作用。
当前挑战
该数据集所解决的核心领域问题在于,现有语言模型对训练数据中的表面形式高度敏感,一旦输入文本遭遇拼写错误、大小写混乱或标点符号的随机增减,模型性能便会显著下降。SALMon_Spirit-LM-Base-normalized2通过对原始语料进行系统性的归一化处理与噪声注入,构建了一个对模型鲁棒性进行全面检测的挑战性基准。在数据集构建过程中,研究者遭遇了多重挑战:首先,如何设计噪声模式才能既反映真实世界的文本变异,又不破坏语义核心,需要精细的规则与统计建模;其次,确保噪声注入的多样性与均衡性,避免模型仅针对特定扰动模式过拟合;此外,大规模语料的归一化与标注需耗费大量人力与计算资源,尤其在低资源语言上,可用的干净语料本就稀缺,进一步增加了生成高质量噪声版本的难度。
常用场景
经典使用场景
SALMon_Spirit-LM-Base-normalized2 数据集在自然语言处理领域扮演着重要角色,尤其适用于情感分析、情绪识别以及心理状态监测等任务。该数据集通过精细化的标注,捕捉语言中蕴含的情感色彩与心理倾向,为构建和评估情感分类模型提供了标准化的测试平台。研究者借助此类数据,可深入探究语言表达与内在情感之间的映射关系,推动情感计算技术的发展。
解决学术问题
该数据集有效解决了情感分析中标注不一致、数据尺度不统一等核心学术难题。通过规范化处理,增强了不同研究之间结果的可比性与可复现性,为跨语料库的情感模型泛化能力评估奠定了坚实基础。其意义在于,它为探索语言情感表达的细微差异提供了可靠的基准,促进了情感识别理论从实验室环境向真实复杂场景的迁移,对心理语言学的定量化研究产生了深远影响。
衍生相关工作
基于该数据集,学界衍生出了多项经典工作,包括利用对抗训练增强情感模型鲁棒性的研究,以及结合迁移学习实现跨语言情感分类的探索。此外,还有工作聚焦于情感归因分析,试图从文本中精确定位引发特定情感的关键词或短语。这些衍生工作不仅拓展了情感分析的技术边界,也催生了诸如多模态情感识别和对话情感追踪等新兴研究方向,形成了活跃的学术生态。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作