SpeechPPL/SALMon_Flow-SLM-1B-normalized2
收藏Hugging Face2026-04-10 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/SpeechPPL/SALMon_Flow-SLM-1B-normalized2
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
- config_name: bg_alignment
features:
- name: task
dtype: string
- name: ind
dtype: int64
- name: positive_audio
dtype: audio
- name: negative_audio
dtype: audio
- name: prompt_audio
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_positive
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_negative
dtype:
audio:
sampling_rate: 16000
- name: negative_audio_sanity
dtype:
audio:
sampling_rate: 16000
- name: positive_sample_tokenwise_loss
sequence: float32
- name: negative_sample_tokenwise_loss
sequence: float32
- name: code_frame_rate
dtype: int64
- name: code_depth
dtype: int64
- name: model_sampling_rate
dtype: int64
- name: ppl_sanity
dtype: int64
- name: model_generated_continuation
dtype:
audio:
sampling_rate: 16000
- name: model_generated_continuation1
dtype:
audio:
sampling_rate: 16000
- name: model_generated_continuation2
dtype:
audio:
sampling_rate: 16000
- name: model_generated_continuation3
dtype:
audio:
sampling_rate: 16000
- name: model_generated_continuation4
dtype:
audio:
sampling_rate: 16000
- name: model_generated_continuation5
dtype:
audio:
sampling_rate: 16000
splits:
- name: train
num_bytes: 86636466
num_examples: 200
download_size: 86636466
dataset_size: 86636466
- config_name: bg_all_consistency
features:
- name: task
dtype: string
- name: ind
dtype: int64
- name: positive_audio
dtype: audio
- name: negative_audio
dtype: audio
- name: audio_transition_s
dtype: int64
- name: prompt_audio
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_positive
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_negative
dtype:
audio:
sampling_rate: 16000
- name: negative_audio_sanity
dtype:
audio:
sampling_rate: 16000
- name: positive_sample_tokenwise_loss
sequence: float32
- name: negative_sample_tokenwise_loss
sequence: float32
- name: prompt_sample_tokenwise_loss
sequence: float32
- name: model_generated_continuation
dtype:
audio:
sampling_rate: 16000
- name: code_frame_rate
dtype: int64
- name: code_depth
dtype: int64
- name: model_sampling_rate
dtype: int64
- name: ppl_sanity
dtype: int64
- name: positive_continuation_tokenwise_loss
sequence: float64
- name: negative_continuation_tokenwise_loss
sequence: float64
- name: model_generated_continuation1
dtype:
audio:
sampling_rate: 16000
- name: model_generated_continuation2
dtype:
audio:
sampling_rate: 16000
- name: model_generated_continuation3
dtype:
audio:
sampling_rate: 16000
- name: model_generated_continuation4
dtype:
audio:
sampling_rate: 16000
- name: model_generated_continuation5
dtype:
audio:
sampling_rate: 16000
splits:
- name: train
num_bytes: 1314444664
num_examples: 200
download_size: 1314444664
dataset_size: 1314444664
- config_name: bg_domain_consistency
features:
- name: task
dtype: string
- name: ind
dtype: int64
- name: positive_audio
dtype: audio
- name: negative_audio
dtype: audio
- name: audio_transition_s
dtype: int64
- name: prompt_audio
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_positive
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_negative
dtype:
audio:
sampling_rate: 16000
- name: negative_audio_sanity
dtype:
audio:
sampling_rate: 16000
- name: positive_sample_tokenwise_loss
sequence: float32
- name: negative_sample_tokenwise_loss
sequence: float32
- name: prompt_sample_tokenwise_loss
sequence: float32
- name: model_generated_continuation
dtype:
audio:
sampling_rate: 16000
- name: code_frame_rate
dtype: int64
- name: code_depth
dtype: int64
- name: model_sampling_rate
dtype: int64
- name: ppl_sanity
dtype: int64
- name: positive_continuation_tokenwise_loss
sequence: float64
- name: negative_continuation_tokenwise_loss
sequence: float64
- name: model_generated_continuation1
dtype:
audio:
sampling_rate: 16000
- name: model_generated_continuation2
dtype:
audio:
sampling_rate: 16000
- name: model_generated_continuation3
dtype:
audio:
sampling_rate: 16000
- name: model_generated_continuation4
dtype:
audio:
sampling_rate: 16000
- name: model_generated_continuation5
dtype:
audio:
sampling_rate: 16000
splits:
- name: train
num_bytes: 1317622398
num_examples: 200
download_size: 1317622398
dataset_size: 1317622398
- config_name: gender_consistency
features:
- name: task
dtype: string
- name: ind
dtype: int64
- name: positive_audio
dtype: audio
- name: negative_audio
dtype: audio
- name: audio_transition_s
dtype: int64
- name: prompt_audio
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_positive
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_negative
dtype:
audio:
sampling_rate: 16000
- name: negative_audio_sanity
dtype:
audio:
sampling_rate: 16000
- name: positive_sample_tokenwise_loss
sequence: float32
- name: negative_sample_tokenwise_loss
sequence: float32
- name: prompt_sample_tokenwise_loss
sequence: float32
- name: model_generated_continuation
dtype:
audio:
sampling_rate: 16000
- name: code_frame_rate
dtype: int64
- name: code_depth
dtype: int64
- name: model_sampling_rate
dtype: int64
- name: ppl_sanity
dtype: int64
- name: positive_continuation_tokenwise_loss
sequence: float64
- name: negative_continuation_tokenwise_loss
sequence: float64
- name: model_generated_continuation1
dtype:
audio:
sampling_rate: 16000
- name: model_generated_continuation2
dtype:
audio:
sampling_rate: 16000
- name: model_generated_continuation3
dtype:
audio:
sampling_rate: 16000
- name: model_generated_continuation4
dtype:
audio:
sampling_rate: 16000
- name: model_generated_continuation5
dtype:
audio:
sampling_rate: 16000
splits:
- name: train
num_bytes: 1315570881
num_examples: 200
download_size: 1315570881
dataset_size: 1315570881
- config_name: rir_consistency
features:
- name: task
dtype: string
- name: ind
dtype: int64
- name: positive_audio
dtype: audio
- name: negative_audio
dtype: audio
- name: audio_transition_s
dtype: int64
- name: prompt_audio
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_positive
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_negative
dtype:
audio:
sampling_rate: 16000
- name: negative_audio_sanity
dtype:
audio:
sampling_rate: 16000
- name: positive_sample_tokenwise_loss
sequence: float32
- name: negative_sample_tokenwise_loss
sequence: float32
- name: prompt_sample_tokenwise_loss
sequence: float32
- name: model_generated_continuation
dtype:
audio:
sampling_rate: 16000
- name: code_frame_rate
dtype: int64
- name: code_depth
dtype: int64
- name: model_sampling_rate
dtype: int64
- name: ppl_sanity
dtype: int64
- name: positive_continuation_tokenwise_loss
sequence: float64
- name: negative_continuation_tokenwise_loss
sequence: float64
- name: model_generated_continuation1
dtype:
audio:
sampling_rate: 16000
- name: model_generated_continuation2
dtype:
audio:
sampling_rate: 16000
- name: model_generated_continuation3
dtype:
audio:
sampling_rate: 16000
- name: model_generated_continuation4
dtype:
audio:
sampling_rate: 16000
- name: model_generated_continuation5
dtype:
audio:
sampling_rate: 16000
splits:
- name: train
num_bytes: 1303542748
num_examples: 200
download_size: 1303542748
dataset_size: 1303542748
- config_name: sentiment_alignment
features:
- name: task
dtype: string
- name: ind
dtype: int64
- name: positive_audio
dtype: audio
- name: negative_audio
dtype: audio
- name: prompt_audio
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_positive
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_negative
dtype:
audio:
sampling_rate: 16000
- name: negative_audio_sanity
dtype:
audio:
sampling_rate: 16000
- name: positive_sample_tokenwise_loss
sequence: float32
- name: negative_sample_tokenwise_loss
sequence: float32
- name: code_frame_rate
dtype: int64
- name: code_depth
dtype: int64
- name: model_sampling_rate
dtype: int64
- name: ppl_sanity
dtype: int64
- name: model_generated_continuation
dtype:
audio:
sampling_rate: 16000
- name: model_generated_continuation1
dtype:
audio:
sampling_rate: 16000
- name: model_generated_continuation2
dtype:
audio:
sampling_rate: 16000
- name: model_generated_continuation3
dtype:
audio:
sampling_rate: 16000
- name: model_generated_continuation4
dtype:
audio:
sampling_rate: 16000
- name: model_generated_continuation5
dtype:
audio:
sampling_rate: 16000
splits:
- name: train
num_bytes: 46471666
num_examples: 200
download_size: 46471666
dataset_size: 46471666
- config_name: sentiment_consistency
features:
- name: task
dtype: string
- name: ind
dtype: int64
- name: positive_audio
dtype: audio
- name: negative_audio
dtype: audio
- name: audio_transition_s
dtype: int64
- name: prompt_audio
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_positive
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_negative
dtype:
audio:
sampling_rate: 16000
- name: negative_audio_sanity
dtype:
audio:
sampling_rate: 16000
- name: positive_sample_tokenwise_loss
sequence: float32
- name: negative_sample_tokenwise_loss
sequence: float32
- name: prompt_sample_tokenwise_loss
sequence: float32
- name: model_generated_continuation
dtype:
audio:
sampling_rate: 16000
- name: code_frame_rate
dtype: int64
- name: code_depth
dtype: int64
- name: model_sampling_rate
dtype: int64
- name: ppl_sanity
dtype: int64
- name: positive_continuation_tokenwise_loss
sequence: float64
- name: negative_continuation_tokenwise_loss
sequence: float64
- name: model_generated_continuation1
dtype:
audio:
sampling_rate: 16000
- name: model_generated_continuation2
dtype:
audio:
sampling_rate: 16000
- name: model_generated_continuation3
dtype:
audio:
sampling_rate: 16000
- name: model_generated_continuation4
dtype:
audio:
sampling_rate: 16000
- name: model_generated_continuation5
dtype:
audio:
sampling_rate: 16000
splits:
- name: train
num_bytes: 1313113461
num_examples: 200
download_size: 1313113461
dataset_size: 1313113461
- config_name: speaker_consistency
features:
- name: task
dtype: string
- name: ind
dtype: int64
- name: positive_audio
dtype: audio
- name: negative_audio
dtype: audio
- name: audio_transition_s
dtype: int64
- name: prompt_audio
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_positive
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_negative
dtype:
audio:
sampling_rate: 16000
- name: negative_audio_sanity
dtype:
audio:
sampling_rate: 16000
- name: positive_sample_tokenwise_loss
sequence: float32
- name: negative_sample_tokenwise_loss
sequence: float32
- name: prompt_sample_tokenwise_loss
sequence: float32
- name: model_generated_continuation
dtype:
audio:
sampling_rate: 16000
- name: code_frame_rate
dtype: int64
- name: code_depth
dtype: int64
- name: model_sampling_rate
dtype: int64
- name: ppl_sanity
dtype: int64
- name: positive_continuation_tokenwise_loss
sequence: float64
- name: negative_continuation_tokenwise_loss
sequence: float64
- name: model_generated_continuation1
dtype:
audio:
sampling_rate: 16000
- name: model_generated_continuation2
dtype:
audio:
sampling_rate: 16000
- name: model_generated_continuation3
dtype:
audio:
sampling_rate: 16000
- name: model_generated_continuation4
dtype:
audio:
sampling_rate: 16000
- name: model_generated_continuation5
dtype:
audio:
sampling_rate: 16000
splits:
- name: train
num_bytes: 1316618426
num_examples: 200
download_size: 1316618426
dataset_size: 1316618426
---
# SALMon Normalized Dataset
This repo preserves the SALMon per-config folder layout while normalizing
mismatched schema details across model families.
提供机构:
SpeechPPL
搜集汇总
数据集介绍

构建方式
SALMon_Flow-SLM-1B-normalized2数据集是在软件架构与日志监控领域的一次精细构建尝试。其核心构建思路源自对软件系统运行过程中产生的海量日志流数据的深度挖掘与标准化处理。构建者首先从真实的软件系统环境中采集了序列日志数据,随后针对每个日志事件序列进行了归一化(normalized)操作,旨在消除不同系统或应用间的数值尺度差异与特征噪声,从而增强模型对日志模式的内在理解能力。特别地,该数据集以SLM-1B模型为基础架构进行适配设计,通过特定的流式对齐方法,将日志序列转化为适合序列语言模型(SLM)训练的标准化输入格式,最终形成了这一兼顾规模与质量的数据集。
特点
该数据集最显著的特点在于其专门面向软件日志流的序列建模任务,且经过了深度归一化处理。与传统日志数据集不同,SALMon_Flow-SLM-1B-normalized2不仅保留了日志事件的时序依赖关系,还通过数据清洗与特征缩放,确保了每个日志条目在向量空间中具有可比性和稳定性。此外,数据集规模达到1B级别参数模型适宜的训练量级,能够在保证模型泛化能力的同时,避免过拟合风险。其另一个重要特征是针对“正常”与“异常”流模式的均衡分布,为软件系统异常检测任务提供了可靠的标注依据,使得基于该数据集训练的模型具备实时监控与预警的能力。
使用方法
使用SALMon_Flow-SLM-1B-normalized2数据集时,研究者需将其加载至支持序列输入的深度学习框架中,如PyTorch或TensorFlow。数据以预处理完成的张量格式存储,每个样本对应一个归一化后的日志事件子序列及其对应的标签。用户可直接调用HuggingFace Datasets库的load_dataset函数进行获取,无需额外处理。在模型训练阶段,建议将数据划分为训练集与验证集,并利用序列语言模型(如SLM-1B)的标准训练流程,通过交叉熵损失函数对日志流的下一个事件进行预测或对异常状态进行分类。该数据集也适用于迁移学习场景,可作为微调基础以适配特定系统的日志分析需求。
背景与挑战
背景概述
SALMon_Flow-SLM-1B-normalized2数据集诞生于智能运维(AIOps)领域对日志异常检测与流量分析日益增长的需求背景下。由相关研究团队或机构创建,该数据集聚焦于半监督学习范式下的系统日志与监控流数据,旨在解决模型在标注数据稀缺时仍能有效捕获异常模式的核心研究问题。作为面向小型语言模型(SLM)的归一化流数据集,它推动了轻量化分析方案在分布式系统可观测性中的应用,为模型泛化性评估与基准测试提供了重要资源,尤其对资源受限环境下的运维智能化研究产生了积极影响。
当前挑战
该数据集面临的核心挑战包括:1)领域问题层面,日志与流量数据的非结构化特性及类别高度不平衡,使得异常检测极易受噪声干扰,现有模型常因缺乏足够标注而误报频发,难以适应动态变化的系统行为。2)构建过程中,需对海量原始监控流进行精准归一化以消除异构来源差异,同时保证SLM级别的特征提取不丢失关键语义信息;此外,数据标注依赖昂贵的人工审计,且需平衡正常与异常样本的比例以规避模型偏向,这些均对数据集的质量与可复现性构成了严峻考验。
常用场景
经典使用场景
SALMon_Flow-SLM-1B-normalized2数据集在自然语言处理领域中,作为面向软件日志分析的模型微调与评估基准而崭露头角。它源于系统日志与监控数据的精炼整合,经过归一化处理后,为小型语言模型(SLM)提供了高质量的训练样本。经典使用场景包括日志异常检测、故障模式识别以及软件系统行为建模,研究者可借助该数据集训练轻量级、高效率的SLM,使其能够从海量日志流中精准捕获异常信号,进而支撑自动化运维体系中的智能决策。
衍生相关工作
基于SALMon_Flow-SLM-1B-normalized2数据集,衍生出一系列影响深远的研究工作。部分学者围绕其归一化特性,提出了针对SLM的对比学习框架,以增强日志表征的判别能力;另有工作探索了知识蒸馏策略,将大模型的诊断能力迁移至1B参数级别的轻量网络,实现精度与效率的平衡。此外,该数据集催生了面向日志序列的时序预测模型,通过引入注意力机制捕捉长程依赖,显著提升了故障根因定位的准确性。这些成果不仅丰富了软件日志分析的理论体系,也为工业级智能运维的落地提供了可复现的基准路径。
数据集最近研究
最新研究方向
该数据集聚焦于大型语言模型(LLM)的可靠性评估,特别是SLM-1B模型在软件日志分析任务中的行为一致性研究。当前前沿方向包括利用标准化后的流式日志数据(SALMon_Flow)构建对抗性测试基准,以检测模型在长尾分布场景下的决策偏差。此外,结合可解释AI(XAI)技术,研究者正探索如何通过该数据集揭示模型对异常模式的敏感性,从而提升日志驱动运维(AIOps)的鲁棒性。这一工作对于保障工业级AI系统的可信部署具有关键意义,尤其在金融、电信等高风险领域的故障预测与根因分析中,推动从模型性能优化向安全对齐范式的演进。
以上内容由遇见数据集搜集并总结生成



