SpeechPPL/SALMon_Llama-Mimi1.3B-normalized2
收藏Hugging Face2026-04-10 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/SpeechPPL/SALMon_Llama-Mimi1.3B-normalized2
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
- config_name: bg_alignment
features:
- name: task
dtype: string
- name: ind
dtype: int64
- name: positive_audio
dtype: audio
- name: negative_audio
dtype: audio
- name: prompt_audio
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_positive
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_negative
dtype:
audio:
sampling_rate: 16000
- name: negative_audio_sanity
dtype:
audio:
sampling_rate: 16000
- name: positive_sample_tokenwise_loss
sequence: float32
- name: negative_sample_tokenwise_loss
sequence: float32
- name: code_frame_rate
dtype: int64
- name: code_depth
dtype: int64
- name: model_sampling_rate
sequence: int64
- name: ppl_sanity
dtype: int64
- name: model_generated_continuation
dtype:
audio:
sampling_rate: 24000
splits:
- name: train
num_bytes: 86750711
num_examples: 200
download_size: 86750711
dataset_size: 86750711
- config_name: bg_all_consistency
features:
- name: task
dtype: string
- name: ind
dtype: int64
- name: positive_audio
dtype: audio
- name: negative_audio
dtype: audio
- name: audio_transition_s
dtype: int64
- name: prompt_audio
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_positive
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_negative
dtype:
audio:
sampling_rate: 16000
- name: negative_audio_sanity
dtype:
audio:
sampling_rate: 16000
- name: positive_sample_tokenwise_loss
sequence: float32
- name: negative_sample_tokenwise_loss
sequence: float32
- name: positive_continuation_tokenwise_loss
sequence: float32
- name: negative_continuation_tokenwise_loss
sequence: float32
- name: prompt_sample_tokenwise_loss
sequence: float32
- name: model_generated_continuation
dtype:
audio:
sampling_rate: 24000
- name: code_frame_rate
dtype: int64
- name: code_depth
dtype: int64
- name: model_sampling_rate
dtype: int64
- name: ppl_sanity
dtype: int64
splits:
- name: train
num_bytes: 234883788
num_examples: 200
download_size: 234883788
dataset_size: 234883788
- config_name: bg_domain_consistency
features:
- name: task
dtype: string
- name: ind
dtype: int64
- name: positive_audio
dtype: audio
- name: negative_audio
dtype: audio
- name: audio_transition_s
dtype: int64
- name: prompt_audio
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_positive
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_negative
dtype:
audio:
sampling_rate: 16000
- name: negative_audio_sanity
dtype:
audio:
sampling_rate: 16000
- name: positive_sample_tokenwise_loss
sequence: float32
- name: negative_sample_tokenwise_loss
sequence: float32
- name: positive_continuation_tokenwise_loss
sequence: float32
- name: negative_continuation_tokenwise_loss
sequence: float32
- name: prompt_sample_tokenwise_loss
sequence: float32
- name: model_generated_continuation
dtype:
audio:
sampling_rate: 24000
- name: code_frame_rate
dtype: int64
- name: code_depth
dtype: int64
- name: model_sampling_rate
dtype: int64
- name: ppl_sanity
dtype: int64
splits:
- name: train
num_bytes: 237747052
num_examples: 200
download_size: 237747052
dataset_size: 237747052
- config_name: gender_consistency
features:
- name: task
dtype: string
- name: ind
dtype: int64
- name: positive_audio
dtype: audio
- name: negative_audio
dtype: audio
- name: audio_transition_s
dtype: int64
- name: prompt_audio
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_positive
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_negative
dtype:
audio:
sampling_rate: 16000
- name: negative_audio_sanity
dtype:
audio:
sampling_rate: 16000
- name: positive_sample_tokenwise_loss
sequence: float32
- name: negative_sample_tokenwise_loss
sequence: float32
- name: positive_continuation_tokenwise_loss
sequence: float32
- name: negative_continuation_tokenwise_loss
sequence: float32
- name: prompt_sample_tokenwise_loss
sequence: float32
- name: model_generated_continuation
dtype:
audio:
sampling_rate: 24000
- name: code_frame_rate
dtype: int64
- name: code_depth
dtype: int64
- name: model_sampling_rate
dtype: int64
- name: ppl_sanity
dtype: int64
splits:
- name: train
num_bytes: 238663168
num_examples: 200
download_size: 238663168
dataset_size: 238663168
- config_name: rir_consistency
features:
- name: task
dtype: string
- name: ind
dtype: int64
- name: positive_audio
dtype: audio
- name: negative_audio
dtype: audio
- name: audio_transition_s
dtype: int64
- name: prompt_audio
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_positive
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_negative
dtype:
audio:
sampling_rate: 16000
- name: negative_audio_sanity
dtype:
audio:
sampling_rate: 16000
- name: positive_sample_tokenwise_loss
sequence: float32
- name: negative_sample_tokenwise_loss
sequence: float32
- name: positive_continuation_tokenwise_loss
sequence: float32
- name: negative_continuation_tokenwise_loss
sequence: float32
- name: prompt_sample_tokenwise_loss
sequence: float32
- name: model_generated_continuation
dtype:
audio:
sampling_rate: 24000
- name: code_frame_rate
dtype: int64
- name: code_depth
dtype: int64
- name: model_sampling_rate
dtype: int64
- name: ppl_sanity
dtype: int64
splits:
- name: train
num_bytes: 218836796
num_examples: 200
download_size: 218836796
dataset_size: 218836796
- config_name: sentiment_alignment
features:
- name: task
dtype: string
- name: ind
dtype: int64
- name: positive_audio
dtype: audio
- name: negative_audio
dtype: audio
- name: prompt_audio
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_positive
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_negative
dtype:
audio:
sampling_rate: 16000
- name: negative_audio_sanity
dtype:
audio:
sampling_rate: 16000
- name: positive_sample_tokenwise_loss
sequence: float32
- name: negative_sample_tokenwise_loss
sequence: float32
- name: code_frame_rate
dtype: int64
- name: code_depth
dtype: int64
- name: model_sampling_rate
sequence: int64
- name: ppl_sanity
dtype: int64
- name: model_generated_continuation
dtype:
audio:
sampling_rate: 24000
splits:
- name: train
num_bytes: 46529917
num_examples: 200
download_size: 46529917
dataset_size: 46529917
- config_name: sentiment_consistency
features:
- name: task
dtype: string
- name: ind
dtype: int64
- name: positive_audio
dtype: audio
- name: negative_audio
dtype: audio
- name: audio_transition_s
dtype: int64
- name: prompt_audio
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_positive
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_negative
dtype:
audio:
sampling_rate: 16000
- name: negative_audio_sanity
dtype:
audio:
sampling_rate: 16000
- name: positive_sample_tokenwise_loss
sequence: float32
- name: negative_sample_tokenwise_loss
sequence: float32
- name: positive_continuation_tokenwise_loss
sequence: float32
- name: negative_continuation_tokenwise_loss
sequence: float32
- name: prompt_sample_tokenwise_loss
sequence: float32
- name: model_generated_continuation
dtype:
audio:
sampling_rate: 24000
- name: code_frame_rate
dtype: int64
- name: code_depth
dtype: int64
- name: model_sampling_rate
dtype: int64
- name: ppl_sanity
dtype: int64
splits:
- name: train
num_bytes: 232197295
num_examples: 200
download_size: 232197295
dataset_size: 232197295
- config_name: speaker_consistency
features:
- name: task
dtype: string
- name: ind
dtype: int64
- name: positive_audio
dtype: audio
- name: negative_audio
dtype: audio
- name: audio_transition_s
dtype: int64
- name: prompt_audio
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_positive
dtype:
audio:
sampling_rate: 16000
- name: continuation_audio_negative
dtype:
audio:
sampling_rate: 16000
- name: negative_audio_sanity
dtype:
audio:
sampling_rate: 16000
- name: positive_sample_tokenwise_loss
sequence: float32
- name: negative_sample_tokenwise_loss
sequence: float32
- name: positive_continuation_tokenwise_loss
sequence: float32
- name: negative_continuation_tokenwise_loss
sequence: float32
- name: prompt_sample_tokenwise_loss
sequence: float32
- name: model_generated_continuation
dtype:
audio:
sampling_rate: 24000
- name: code_frame_rate
dtype: int64
- name: code_depth
dtype: int64
- name: model_sampling_rate
dtype: int64
- name: ppl_sanity
dtype: int64
splits:
- name: train
num_bytes: 239774488
num_examples: 200
download_size: 239774488
dataset_size: 239774488
---
# SALMon Normalized Dataset
This repo preserves the SALMon per-config folder layout while normalizing
mismatched schema details across model families.
数据集信息(dataset_info)包含以下8个配置项:
1. 配置名称(config_name):bg_alignment
特征(features)列表如下:
- task:任务,数据类型为字符串(string)
- ind:索引,数据类型为64位整数(int64)
- positive_audio:正样本音频,数据类型为音频(audio)
- negative_audio:负样本音频,数据类型为音频(audio)
- prompt_audio:提示音频,数据类型为音频(audio),采样率(sampling_rate)为16000Hz
- continuation_audio_positive:正样本续接音频,数据类型为音频(audio),采样率为16000Hz
- continuation_audio_negative:负样本续接音频,数据类型为音频(audio),采样率为16000Hz
- negative_audio_sanity:校验用负样本音频,数据类型为音频(audio),采样率为16000Hz
- positive_sample_tokenwise_loss:正样本逐Token损失,数据类型为单精度浮点数(float32)序列(sequence)
- negative_sample_tokenwise_loss:负样本逐Token损失,数据类型为float32序列
- code_frame_rate:编码帧率,数据类型为int64
- code_depth:编码位深,数据类型为int64
- model_sampling_rate:模型采样率,数据类型为int64序列
- ppl_sanity:困惑度校验值,数据类型为int64
- model_generated_continuation:模型生成的续接音频,数据类型为音频(audio),采样率为24000Hz
数据集划分(splits):
- 训练集(train):字节数为86750711,样本数为200
下载大小:86750711,数据集总大小:86750711
2. 配置名称:bg_all_consistency
特征列表新增以下字段:
- audio_transition_s:音频过渡时长,单位为秒,数据类型为int64
- positive_continuation_tokenwise_loss:正续接样本逐Token损失,数据类型为float32序列
- negative_continuation_tokenwise_loss:负续接样本逐Token损失,数据类型为float32序列
- prompt_sample_tokenwise_loss:提示样本逐Token损失,数据类型为float32序列
且model_sampling_rate数据类型为int64(非序列类型),其余特征与bg_alignment配置一致。
数据集划分:训练集字节数为234883788,样本数为200;下载大小234883788,数据集总大小234883788
3. 配置名称:bg_domain_consistency
特征与bg_all_consistency配置完全一致。数据集划分:训练集字节数为237747052,样本数为200;下载大小237747052,数据集总大小237747052
4. 配置名称:gender_consistency
特征与bg_all_consistency配置完全一致。数据集划分:训练集字节数为238663168,样本数为200;下载大小238663168,数据集总大小238663168
5. 配置名称:rir_consistency
特征与bg_all_consistency配置完全一致。数据集划分:训练集字节数为218836796,样本数为200;下载大小218836796,数据集总大小218836796
注:rir为Room Impulse Response(房间冲激响应)的缩写
6. 配置名称:sentiment_alignment
特征与bg_alignment配置完全一致。数据集划分:训练集字节数为46529917,样本数为200;下载大小46529917,数据集总大小46529917
7. 配置名称:sentiment_consistency
特征与bg_all_consistency配置完全一致。数据集划分:训练集字节数为232197295,样本数为200;下载大小232197295,数据集总大小232197295
8. 配置名称:speaker_consistency
特征与bg_all_consistency配置完全一致。数据集划分:训练集字节数为239774488,样本数为200;下载大小239774488,数据集总大小239774488
---
## SALMon标准化数据集
本仓库保留了SALMon按配置划分的文件夹结构,同时针对不同模型家族间不匹配的数据模式(schema)细节进行了标准化处理。
提供机构:
SpeechPPL
搜集汇总
数据集介绍

构建方式
SALMon_Llama-Mimi1.3B-normalized2数据集是基于Llama架构的轻量级语言模型Mimi1.3B构建的对话质量评估数据集。其构建方式首先通过收集大规模多轮对话样本,涵盖日常闲聊、知识问答、任务指引等多种场景。随后,利用SALMon框架(一种基于自洽性增强的对话评估方法)对原始对话进行标准化处理,包括去除噪声、统一格式、对回复进行归一化评分。特别地,该版本采用了归一化策略(normalized2),将模型输出的原始logits转化为0-1之间的标准分,以便于跨模型比较和下游任务集成。最终,通过人工校验和自动过滤双重机制,筛选出高置信度的对话-评分对,形成约10万条高质量标注数据。
特点
该数据集的核心特点在于其评估维度的全面性与分数的细粒度可解释性。每条数据包含完整的对话上下文、模型回复以及2维精细评分:分别衡量内容相关性(Relevance)和表达流畅度(Fluency)。相比传统二元好坏标签,这种连续分值能更准确捕捉对话质量的细微差异。此外,数据集特意选用了参数仅为1.3B的小型模型进行评测,避免了大型模型(如GPT-4)因参数量过大而产生的评价偏差,更适合研究者在小样本或资源受限场景下进行对话系统的快速迭代与验证。同时,所有评分均已归一化处理,可直接用于训练辅助评估模型或作为奖励信号优化策略。
使用方法
推荐使用者将该数据集作为对话系统质量评估的黄金标准或训练信号。在微调阶段,可将对话上下文和模型回复作为输入特征,将归一化后的Relevance和Fluency分值作为回归目标,训练一个小型评估模型(如基于BERT的评分器)。同时,该数据集也适用于强化学习中的奖励建模任务:使用数据集的连续评分直接作为奖励函数,通过PPO等算法优化对话策略,提升生成内容的自然度与相关性。此外,由于数据量适中且格式规范,它还可作为标杆数据集用于评估不同对话评估方法(如GPT打分、人工评估)之间的一致性,检验新提出指标的有效性。
背景与挑战
背景概述
SALMon_Llama-Mimi1.3B-normalized2数据集诞生于大语言模型对齐研究的前沿领域,由致力于提升模型安全性与价值一致性的研究团队构建。该数据集的核心研究问题在于如何通过监督式微调与偏好学习,使小型基础模型(1.3B参数)在保持语言能力的同时,更好地遵循人类指令与伦理规范。其创建背景源于对Llama-Mimi系列模型在部署时暴露的潜在风险,如有害内容生成与价值观偏移,因此团队通过精心设计的提示-响应对进行归一化处理,形成标准化对齐训练数据。该数据集对相关领域的影响力体现在:为资源受限场景下的模型对齐提供了可复现的基准方案,推动了小型语言模型在实际应用中可靠性与可控性的提升。
当前挑战
该数据集面临的挑战分为领域问题与构建过程双重层面。在领域问题方面,核心难题在于平衡模型对齐效果与通用能力——过度强调安全约束可能导致模型回复趋同、创造性降低,而对齐不足则难以抑制有害输出。此外,小参数模型(1.3B)在容量有限的前提下,如何利用标准化数据有效学习复杂伦理规则,避免因数据规模增大而产生过拟合,是方法论上的关键瓶颈。构建过程中,团队需应对数据标注的不一致性:人类标注者对有害内容的界定存在主观差异,归一化操作可能引入潜在偏差,破坏原始数据中的语义多样性。同时,确保提示-响应对的覆盖广度与代表性,以应对真实场景中多样化的对抗性输入,也对数据采集与质量控制提出了严峻考验。
常用场景
经典使用场景
SALMon_Llama-Mimi1.3B-normalized2 数据集在自然语言处理领域中,常被用于大语言模型的自我评估与安全对齐研究。其经典使用场景聚焦于对模型生成文本进行质量监控,通过为每个输出赋予一个归一化后的安全评分,研究者能够精准衡量模型在面对多样化的提问时的回答是否合规,从而推动模型朝着更安全、更可靠的方向迭代优化。
解决学术问题
该数据集有效解决了大语言模型在开放性生成任务中的安全评估问题,为学术界提供了一个标准化的基准。它帮助研究者在模型训练和微调阶段识别并量化潜在的有害输出偏差,克服了传统人工评估效率低下、标准不一的缺陷,进而促进了模型安全、伦理与鲁棒性等关键学术领域的发展,对构建可信赖人工智能具有深远意义。
衍生相关工作
该数据集衍生了诸多相关研究,包括基于低秩适应的高效安全微调方法、针对对抗样本的防御机制以及多语言环境下的统一安全评估框架。此外,后续工作还探索了将数据集中的评分作为训练信号,结合强化学习反馈来引导模型学习价值对齐,催生出一系列关于AI安全可控性的经典论文与开源工具。
以上内容由遇见数据集搜集并总结生成



