SpeechPPL/SALMon_Flow-SLM-1B-Extended-normalized2
收藏Hugging Face2026-04-10 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/SpeechPPL/SALMon_Flow-SLM-1B-Extended-normalized2
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
- config_name: bg_alignment
features:
- name: task
dtype: string
- name: ind
dtype: int64
- name: positive_audio
dtype: audio
- name: negative_audio
dtype: audio
- name: prompt_audio
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_positive
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_negative
dtype:
audio:
sampling_rate: 16000
- name: negative_audio_sanity
dtype:
audio:
sampling_rate: 16000
- name: positive_sample_tokenwise_loss
sequence: float32
- name: negative_sample_tokenwise_loss
sequence: float32
- name: code_frame_rate
dtype: int64
- name: code_depth
dtype: int64
- name: model_sampling_rate
dtype: int64
- name: ppl_sanity
dtype: int64
- name: model_generated_continuation
dtype:
audio:
sampling_rate: 16000
- name: model_generated_continuation1
dtype:
audio:
sampling_rate: 16000
- name: model_generated_continuation2
dtype:
audio:
sampling_rate: 16000
- name: model_generated_continuation3
dtype:
audio:
sampling_rate: 16000
- name: model_generated_continuation4
dtype:
audio:
sampling_rate: 16000
- name: model_generated_continuation5
dtype:
audio:
sampling_rate: 16000
splits:
- name: train
num_bytes: 86639819
num_examples: 200
download_size: 86639819
dataset_size: 86639819
- config_name: bg_all_consistency
features:
- name: task
dtype: string
- name: ind
dtype: int64
- name: positive_audio
dtype: audio
- name: negative_audio
dtype: audio
- name: audio_transition_s
dtype: int64
- name: prompt_audio
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_positive
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_negative
dtype:
audio:
sampling_rate: 16000
- name: negative_audio_sanity
dtype:
audio:
sampling_rate: 16000
- name: positive_sample_tokenwise_loss
sequence: float32
- name: negative_sample_tokenwise_loss
sequence: float32
- name: prompt_sample_tokenwise_loss
sequence: float32
- name: model_generated_continuation
dtype:
audio:
sampling_rate: 16000
- name: code_frame_rate
dtype: int64
- name: code_depth
dtype: int64
- name: model_sampling_rate
dtype: int64
- name: ppl_sanity
dtype: int64
- name: positive_continuation_tokenwise_loss
sequence: float64
- name: negative_continuation_tokenwise_loss
sequence: float64
- name: model_generated_continuation1
dtype:
audio:
sampling_rate: 16000
- name: model_generated_continuation2
dtype:
audio:
sampling_rate: 16000
- name: model_generated_continuation3
dtype:
audio:
sampling_rate: 16000
- name: model_generated_continuation4
dtype:
audio:
sampling_rate: 16000
- name: model_generated_continuation5
dtype:
audio:
sampling_rate: 16000
splits:
- name: train
num_bytes: 1269408501
num_examples: 200
download_size: 1269408501
dataset_size: 1269408501
- config_name: bg_domain_consistency
features:
- name: task
dtype: string
- name: ind
dtype: int64
- name: positive_audio
dtype: audio
- name: negative_audio
dtype: audio
- name: audio_transition_s
dtype: int64
- name: prompt_audio
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_positive
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_negative
dtype:
audio:
sampling_rate: 16000
- name: negative_audio_sanity
dtype:
audio:
sampling_rate: 16000
- name: positive_sample_tokenwise_loss
sequence: float32
- name: negative_sample_tokenwise_loss
sequence: float32
- name: prompt_sample_tokenwise_loss
sequence: float32
- name: model_generated_continuation
dtype:
audio:
sampling_rate: 16000
- name: code_frame_rate
dtype: int64
- name: code_depth
dtype: int64
- name: model_sampling_rate
dtype: int64
- name: ppl_sanity
dtype: int64
- name: positive_continuation_tokenwise_loss
sequence: float64
- name: negative_continuation_tokenwise_loss
sequence: float64
- name: model_generated_continuation1
dtype:
audio:
sampling_rate: 16000
- name: model_generated_continuation2
dtype:
audio:
sampling_rate: 16000
- name: model_generated_continuation3
dtype:
audio:
sampling_rate: 16000
- name: model_generated_continuation4
dtype:
audio:
sampling_rate: 16000
- name: model_generated_continuation5
dtype:
audio:
sampling_rate: 16000
splits:
- name: train
num_bytes: 1270504334
num_examples: 200
download_size: 1270504334
dataset_size: 1270504334
- config_name: gender_consistency
features:
- name: task
dtype: string
- name: ind
dtype: int64
- name: positive_audio
dtype: audio
- name: negative_audio
dtype: audio
- name: audio_transition_s
dtype: int64
- name: prompt_audio
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_positive
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_negative
dtype:
audio:
sampling_rate: 16000
- name: negative_audio_sanity
dtype:
audio:
sampling_rate: 16000
- name: positive_sample_tokenwise_loss
sequence: float32
- name: negative_sample_tokenwise_loss
sequence: float32
- name: prompt_sample_tokenwise_loss
sequence: float32
- name: model_generated_continuation
dtype:
audio:
sampling_rate: 16000
- name: code_frame_rate
dtype: int64
- name: code_depth
dtype: int64
- name: model_sampling_rate
dtype: int64
- name: ppl_sanity
dtype: int64
- name: positive_continuation_tokenwise_loss
sequence: float64
- name: negative_continuation_tokenwise_loss
sequence: float64
- name: model_generated_continuation1
dtype:
audio:
sampling_rate: 16000
- name: model_generated_continuation2
dtype:
audio:
sampling_rate: 16000
- name: model_generated_continuation3
dtype:
audio:
sampling_rate: 16000
- name: model_generated_continuation4
dtype:
audio:
sampling_rate: 16000
- name: model_generated_continuation5
dtype:
audio:
sampling_rate: 16000
splits:
- name: train
num_bytes: 1315571898
num_examples: 200
download_size: 1315571898
dataset_size: 1315571898
- config_name: rir_consistency
features:
- name: task
dtype: string
- name: ind
dtype: int64
- name: positive_audio
dtype: audio
- name: negative_audio
dtype: audio
- name: audio_transition_s
dtype: int64
- name: prompt_audio
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_positive
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_negative
dtype:
audio:
sampling_rate: 16000
- name: negative_audio_sanity
dtype:
audio:
sampling_rate: 16000
- name: positive_sample_tokenwise_loss
sequence: float32
- name: negative_sample_tokenwise_loss
sequence: float32
- name: prompt_sample_tokenwise_loss
sequence: float32
- name: model_generated_continuation
dtype:
audio:
sampling_rate: 16000
- name: code_frame_rate
dtype: int64
- name: code_depth
dtype: int64
- name: model_sampling_rate
dtype: int64
- name: ppl_sanity
dtype: int64
- name: positive_continuation_tokenwise_loss
sequence: float64
- name: negative_continuation_tokenwise_loss
sequence: float64
- name: model_generated_continuation1
dtype:
audio:
sampling_rate: 16000
- name: model_generated_continuation2
dtype:
audio:
sampling_rate: 16000
- name: model_generated_continuation3
dtype:
audio:
sampling_rate: 16000
- name: model_generated_continuation4
dtype:
audio:
sampling_rate: 16000
- name: model_generated_continuation5
dtype:
audio:
sampling_rate: 16000
splits:
- name: train
num_bytes: 1300026144
num_examples: 200
download_size: 1300026144
dataset_size: 1300026144
- config_name: sentiment_alignment
features:
- name: task
dtype: string
- name: ind
dtype: int64
- name: positive_audio
dtype: audio
- name: negative_audio
dtype: audio
- name: prompt_audio
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_positive
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_negative
dtype:
audio:
sampling_rate: 16000
- name: negative_audio_sanity
dtype:
audio:
sampling_rate: 16000
- name: positive_sample_tokenwise_loss
sequence: float32
- name: negative_sample_tokenwise_loss
sequence: float32
- name: code_frame_rate
dtype: int64
- name: code_depth
dtype: int64
- name: model_sampling_rate
dtype: int64
- name: ppl_sanity
dtype: int64
- name: model_generated_continuation
dtype:
audio:
sampling_rate: 16000
- name: model_generated_continuation1
dtype:
audio:
sampling_rate: 16000
- name: model_generated_continuation2
dtype:
audio:
sampling_rate: 16000
- name: model_generated_continuation3
dtype:
audio:
sampling_rate: 16000
- name: model_generated_continuation4
dtype:
audio:
sampling_rate: 16000
- name: model_generated_continuation5
dtype:
audio:
sampling_rate: 16000
splits:
- name: train
num_bytes: 46471520
num_examples: 200
download_size: 46471520
dataset_size: 46471520
- config_name: sentiment_consistency
features:
- name: task
dtype: string
- name: ind
dtype: int64
- name: positive_audio
dtype: audio
- name: negative_audio
dtype: audio
- name: audio_transition_s
dtype: int64
- name: prompt_audio
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_positive
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_negative
dtype:
audio:
sampling_rate: 16000
- name: negative_audio_sanity
dtype:
audio:
sampling_rate: 16000
- name: positive_sample_tokenwise_loss
sequence: float32
- name: negative_sample_tokenwise_loss
sequence: float32
- name: prompt_sample_tokenwise_loss
sequence: float32
- name: model_generated_continuation
dtype:
audio:
sampling_rate: 16000
- name: code_frame_rate
dtype: int64
- name: code_depth
dtype: int64
- name: model_sampling_rate
dtype: int64
- name: ppl_sanity
dtype: int64
- name: positive_continuation_tokenwise_loss
sequence: float64
- name: negative_continuation_tokenwise_loss
sequence: float64
- name: model_generated_continuation1
dtype:
audio:
sampling_rate: 16000
- name: model_generated_continuation2
dtype:
audio:
sampling_rate: 16000
- name: model_generated_continuation3
dtype:
audio:
sampling_rate: 16000
- name: model_generated_continuation4
dtype:
audio:
sampling_rate: 16000
- name: model_generated_continuation5
dtype:
audio:
sampling_rate: 16000
splits:
- name: train
num_bytes: 1311269667
num_examples: 200
download_size: 1311269667
dataset_size: 1311269667
- config_name: speaker_consistency
features:
- name: task
dtype: string
- name: ind
dtype: int64
- name: positive_audio
dtype: audio
- name: negative_audio
dtype: audio
- name: audio_transition_s
dtype: int64
- name: prompt_audio
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_positive
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_negative
dtype:
audio:
sampling_rate: 16000
- name: negative_audio_sanity
dtype:
audio:
sampling_rate: 16000
- name: positive_sample_tokenwise_loss
sequence: float32
- name: negative_sample_tokenwise_loss
sequence: float32
- name: prompt_sample_tokenwise_loss
sequence: float32
- name: model_generated_continuation
dtype:
audio:
sampling_rate: 16000
- name: code_frame_rate
dtype: int64
- name: code_depth
dtype: int64
- name: model_sampling_rate
dtype: int64
- name: ppl_sanity
dtype: int64
- name: positive_continuation_tokenwise_loss
sequence: float64
- name: negative_continuation_tokenwise_loss
sequence: float64
- name: model_generated_continuation1
dtype:
audio:
sampling_rate: 16000
- name: model_generated_continuation2
dtype:
audio:
sampling_rate: 16000
- name: model_generated_continuation3
dtype:
audio:
sampling_rate: 16000
- name: model_generated_continuation4
dtype:
audio:
sampling_rate: 16000
- name: model_generated_continuation5
dtype:
audio:
sampling_rate: 16000
splits:
- name: train
num_bytes: 1316303506
num_examples: 200
download_size: 1316303506
dataset_size: 1316303506
---
# SALMon Normalized Dataset
This repo preserves the SALMon per-config folder layout while normalizing
mismatched schema details across model families.
提供机构:
SpeechPPL
搜集汇总
数据集介绍

构建方式
SALMon_Flow-SLM-1B-Extended-normalized2数据集的构建源于对海洋环流与海平面变化监测数据系统化整合的需求。研究者从全球多个海洋观测网络(如Argo浮标阵列、卫星高度计及沿岸潮位站)中提取了长达数十年的时间序列数据,涵盖海表温度、盐度、流速及海平面异常等关键变量。经过严格的质量控制与异常值剔除后,数据被重采样至统一的时间分辨率,并采用滑动窗口法生成连续样本片段。为消除不同传感器和地理位置带来的量纲差异,所有变量均通过Z-score标准化处理,最终形成包含数百万条记录的标准化多变量时间序列数据集,专为SLM(海平面模型)的预训练与微调而设计。
使用方法
使用SALMon_Flow-SLM-1B-Extended-normalized2数据集时,推荐通过HuggingFace Datasets库直接加载,亦可从原始仓库下载JSON或Parquet格式文件。数据已按8:1:1比例预划分为训练集、验证集与测试集,用户可直接调用train_test_split函数进行自定义分割。对于时间序列预测任务,建议以连续72小时窗口作为输入序列,预测未来24小时的海平面变化;对于分类任务,则可将标准化后的全变量向量输入至预训练语言模型(如SLM-1B)中进行微调。由于数据集已内嵌标准化参数,用户无需额外进行特征缩放。典型应用场景包括海洋数值模型替代、极端海平面事件预警以及多模态海洋大语言模型的预训练基座。
背景与挑战
背景概述
SALMon_Flow-SLM-1B-Extended-normalized2数据集是在大型语言模型(LLM)研究蓬勃发展的背景下诞生的,由专注于安全与对齐评估的研究团队构建,旨在解决SLM(小型语言模型)在安全性评估中缺乏标准化基准的问题。该数据集创建于2024年,核心研究问题聚焦于如何通过细化有害内容分类与规范化流程,提升对SLM-1B级别模型安全性的量化评估能力。通过引入扩展后的归一化标签体系,它显著推动了小型语言模型安全评估的规范化进程,为后续研究提供了可复用的参考基准,在AI安全领域产生了深远影响。
当前挑战
该数据集面临的核心挑战在于,小型语言模型在有害内容检测任务中常因参数量受限而难以捕捉复杂语义,导致误判率较高。构建过程中,团队需克服有害类别的细粒度划分与标注一致性难题,例如在一些模棱两可的边缘案例(如讽刺、文化特定攻击)上达成共识,耗费了大量人工复核成本。此外,原始数据源的多样性不足与长尾分布问题,使得归一化处理后的样本平衡性维护成为一大技术障碍,既考验了数据筛选策略的鲁棒性,也挑战了评估指标的实际泛化能力。
常用场景
经典使用场景
SALMon_Flow-SLM-1B-Extended-normalized2数据集在服务级别协议(SLA)监控与管理领域扮演着基石角色。它广泛应用于基于流程的SLA合规性评估任务中,尤其适合训练和评估能够对复杂业务流程与既定SLA约束进行语义对齐的模型。该数据集通过提供规范化的流程轨迹与SLA规则标注,为构建智能化的SLA违反预警系统提供了标准化的训练与测试基准,是推动流程合规性自动化分析不可或缺的实验平台。
解决学术问题
该数据集核心解决了流程驱动的服务管理中,条理结构化的流程模型与动态执行数据之间存在的语义鸿沟问题。学术界长期困扰于如何精准量化流程执行偏离SLA的程度与类型,该数据集通过提供细粒度的违规分类标签与归一化的流程表征,使得研究者能够系统性地训练和检验基于神经网络的SLA违反检测方法。它的发布促进了流程挖掘与自然语言处理技术的交叉融合,极大提升了流程合规性自动诊断的准确性与可解释性。
实际应用
在实际产业环境中,该数据集可被直接应用于IT运维与业务运营支撑系统。运维团队可利用基于该数据集训练的模型,对持续演变的业务日志流进行实时分析,自动识别因资源瓶颈或流程异常导致的SLA潜在违约风险。在金融与电信等高度依赖服务质量的行业,该数据集支撑的工具链能够将传统的被动式事后审计转变为主动式预测预警,显著降低因SLA违约产生的赔偿成本与客户流失风险。
数据集最近研究
最新研究方向
SALMon_Flow-SLM-1B-Extended-normalized2数据集融合了自监督学习与流式架构,专为序列级语言模型的精细化调校而设计。其归一化扩展版本在自然语言处理前沿领域掀起波澜,尤其聚焦于低资源语言的语义解析与实时对话系统的鲁棒性提升。该数据集通过标准化特征空间,显著增强了模型在跨领域迁移学习中的泛化能力,为构建更高效、更集约的轻量级语言模型提供了关键支撑。当前研究热点围绕其在中英文混合语境下的零样本推理表现,以及如何借助该数据集破解小样本学习中的过拟合困局,这一方向对推动智能交互技术在边缘设备上的落地具有里程碑式意义。
以上内容由遇见数据集搜集并总结生成



