SamSum-Pref
收藏魔搭社区2025-12-03 更新2025-11-22 收录
下载链接:
https://modelscope.cn/datasets/dada122/SamSum-Pref
下载链接
链接失效反馈官方服务:
资源简介:
# SamSum-Pref Dataset
SamSum-Pref is a preference-aligned dialogue summarization dataset constructed by sampling from **dadastory/SummOrchestra-Qwen3-8B-GRPO-BRL-SAMSUM**, and filtering samples using **DeepSeek-V3** as the evaluator. Preference scoring follows the **AnythingReward** evaluation paradigm, adapted to a strict rubric for dialogue-summary quality.
## Evaluation Principles
Each sampled summary is scored according to the following weighted criteria:
1. **Key Information Coverage (40%)**
- Captures core elements: request/proposal, refusal, insistence, and implied motivation.
- Missing any major element is a critical error.
2. **Inference & Implicit Understanding (30%)**
- Correctly reflects implied attitudes or emotional tone.
- Encourages reasonable inference; penalizes fabrication.
3. **Faithfulness & Precision (20%)**
- No hallucinations; meaning preserved.
- Summary must remain strictly grounded in the dialogue.
4. **Conciseness & Clarity (10%)**
- Brief, well-structured, readable.
- Verbosity lowers the score.
**Conflict resolution priority:**
Key coverage **>** Faithfulness **>** Inference **>** Clarity.
## Sampling & Filtering
- Ten samples are randomly drawn per batch from the base model.
- DeepSeek-V3 provides a 1–5 preference score using the above rubric.
- Only summaries with **score = 5** and judged **better than the original SamSum summary** in faithfulness and human preference alignment are retained.
## Data Format
Each accepted entry is stored as a dictionary:
```python
{
"system_prompt": system_prompt,
"instruction": instruction,
"reason_content": reason_content,
"summary": summary
}
```
## Purpose
SamSum-Pref provides a high-quality, preference-filtered benchmark for training and evaluating dialogue summarization models with strong grounding, human-like judgment, and improved alignment over the original SamSum dataset.
# SamSum-Pref 数据集
SamSum-Pref 是一款对齐偏好的对话摘要数据集,其构建流程为从**dadastory/SummOrchestra-Qwen3-8B-GRPO-BRL-SAMSUM**中采样样本,并以**DeepSeek-V3**作为评估器完成样本筛选。偏好评分遵循**AnythingReward**评估范式,并针对对话摘要质量适配了严格的评分准则。
## 评估准则
每份采样得到的摘要将按照以下加权标准进行评分:
1. **关键信息覆盖率(40%)**
- 需捕捉对话核心要素:请求/提议、拒绝、坚持以及隐含动机。
- 遗漏任意主要要素均属于严重错误。
2. **推理与隐含理解能力(30%)**
- 需准确反映对话中的隐含态度或情绪基调。
- 鼓励合理推断,但严禁编造无关内容。
3. **忠实性与精准性(20%)**
- 不得出现幻觉内容,需完整保留原文语义。
- 摘要必须严格基于原始对话内容生成。
4. **简洁性与清晰度(10%)**
- 表述简洁、结构清晰、易于阅读理解。
- 冗余表述将降低最终评分。
**冲突解决优先级**:
关键信息覆盖率 **>** 忠实性 **>** 推理能力 **>** 清晰度。
## 采样与筛选流程
- 每个批次随机抽取10个来自基础模型的生成摘要样本。
- DeepSeek-V3 将按照上述评分准则为每份样本给出1~5分的偏好评分。
- 仅保留评分**为5分**,且在忠实性与人类偏好对齐程度上优于原始SamSum摘要的样本。
## 数据格式
每条符合收录标准的条目均以Python字典形式存储,结构如下:
python
{
"system_prompt": "系统提示词",
"instruction": "生成指令",
"reason_content": "评估理由内容",
"summary": "生成摘要"
}
## 应用价值
SamSum-Pref 可提供高质量、经偏好筛选的基准数据集,用于训练与评估具备强锚定能力、类人类判断逻辑,且相较于原始SamSum数据集对齐效果更优的对话摘要模型。
提供机构:
maas
创建时间:
2025-11-17



