samsum
收藏魔搭社区2025-12-10 更新2025-09-13 收录
下载链接:
https://modelscope.cn/datasets/knkarthick/samsum
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for SAMSum Corpus
## Dataset Description
### Links
- **Homepage:** hhttps://arxiv.org/abs/1911.12237v2
- **Repository:** https://arxiv.org/abs/1911.12237v2
- **Paper:** https://arxiv.org/abs/1911.12237v2
- **Point of Contact:** https://huggingface.co/knkarthick
### Dataset Summary
The SAMSum dataset contains about 16k messenger-like conversations with summaries. Conversations were created and written down by linguists fluent in English. Linguists were asked to create conversations similar to those they write on a daily basis, reflecting the proportion of topics of their real-life messenger conversations. The style and register are diversified - conversations could be informal, semi-formal or formal, they may contain slang words, emoticons and typos. Then, the conversations were annotated with summaries. It was assumed that summaries should be a concise brief of what people talked about in the conversation in third person.
The SAMSum dataset was prepared by Samsung R&D Institute Poland and is distributed for research purposes (non-commercial licence: CC BY-NC-ND 4.0).
### Languages
English
## Dataset Structure
### Data Instances
SAMSum dataset is made of 16369 conversations distributed uniformly into 4 groups based on the number of utterances in con- versations: 3-6, 7-12, 13-18 and 19-30. Each utterance contains the name of the speaker. Most conversations consist of dialogues between two interlocutors (about 75% of all conversations), the rest is between three or more people
The first instance in the training set:
{'id': '13818513', 'summary': 'Amanda baked cookies and will bring Jerry some tomorrow.', 'dialogue': "Amanda: I baked cookies. Do you want some?\r\nJerry: Sure!\r\nAmanda: I'll bring you tomorrow :-)"}
### Data Fields
- dialogue: text of dialogue.
- summary: human written summary of the dialogue.
- id: unique file id of an example.
### Data Splits
- train: 14732
- val: 818
- test: 819
## Dataset Creation
### Curation Rationale
In paper:
In the first approach, we reviewed datasets from the following categories: chatbot dialogues, SMS corpora, IRC/chat data, movie dialogues, tweets, comments data (conversations formed by replies to comments), transcription of meetings, written discussions, phone dialogues and daily communication data. Unfortunately, they all differed in some respect from the conversations that are typically written in messenger apps, e.g. they were too technical (IRC data), too long (comments data, transcription of meetings), lacked context (movie dialogues) or they were more of a spoken type, such as a dialogue between a petrol station assistant and a client buying petrol.
As a consequence, we decided to create a chat dialogue dataset by constructing such conversations that would epitomize the style of a messenger app.
### Who are the source language producers?
linguists
### Who are the annotators?
language experts
### Annotation process
In paper:
Each dialogue was created by one person. After collecting all of the conversations, we asked language experts to annotate them with summaries, assuming that they should (1) be rather short, (2) extract important pieces of information, (3) include names of interlocutors, (4) be written in the third person. Each dialogue contains only one reference summary.
## Licensing Information
non-commercial licence: CC BY-NC-ND 4.0
## Citation Information
```
@inproceedings{gliwa-etal-2019-samsum,
title = "{SAMS}um Corpus: A Human-annotated Dialogue Dataset for Abstractive Summarization",
author = "Gliwa, Bogdan and
Mochol, Iwona and
Biesek, Maciej and
Wawer, Aleksander",
booktitle = "Proceedings of the 2nd Workshop on New Frontiers in Summarization",
month = nov,
year = "2019",
address = "Hong Kong, China",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/D19-5409",
doi = "10.18653/v1/D19-5409",
pages = "70--79"
}
```
## Contributions
# SAMSum Corpus 数据集卡片
## 数据集描述
### 链接
- **主页:** https://arxiv.org/abs/1911.12237v2
- **代码仓库:** https://arxiv.org/abs/1911.12237v2
- **相关论文:** https://arxiv.org/abs/1911.12237v2
- **联系对接人:** https://huggingface.co/knkarthick
### 数据集概述
SAMSum数据集包含约1.6万条类信使应用对话及对应摘要。这些对话由精通英语的语言学家创作撰写,研究人员要求语言学家创作符合日常书写风格的对话,还原现实信使对话的主题分布比例。对话的风格与语体多样,涵盖非正式、半正式及正式场景,可能包含俚语、表情符号与拼写错误。随后,由专业人员为对话标注摘要,要求摘要以第三人称为视角,简洁凝练地概括对话核心内容。
该数据集由三星波兰研究院(Samsung R&D Institute Poland)整理制作,仅用于非商业研究用途,采用CC BY-NC-ND 4.0许可协议。
### 语言
英语
## 数据集结构
### 数据实例
SAMSum数据集共包含16369条对话,依据对话的轮次数量均匀划分为4组:3-6轮、7-12轮、13-18轮以及19-30轮。每条对话轮次均包含发言者姓名。多数对话为双人对话(占全部对话的约75%),剩余为三人及以上参与的对话。
训练集第一条示例如下:
{"id": "13818513", "summary": "Amanda烤制了曲奇,明日将为Jerry带一些。", "dialogue": "Amanda: 我烤了曲奇,你要来点吗?
Jerry: 当然!
Amanda: 我明天带给你 :-)"}
### 数据字段
- dialogue:对话文本
- summary:人工撰写的对话摘要
- id:样本的唯一文件标识
### 数据划分
- 训练集(train):14732条
- 验证集(val):818条
- 测试集(test):819条
## 数据集构建
### 构建初衷
论文中提及:
研究团队首先梳理了多类现有数据集,包括聊天机器人对话、短信语料库、IRC/聊天记录、电影台词、推文、评论对话(由评论回复构成的对话)、会议转录文本、书面讨论、电话对话及日常通信数据。遗憾的是,这些数据集均在某些维度上与主流信使应用的日常对话存在差异:例如过于专业(IRC数据)、篇幅过长(评论数据、会议转录文本)、缺乏上下文关联(电影台词),或是更偏向口语化场景,如加油站店员与加油顾客的对话。
因此,研究团队决定构建一类贴合信使应用风格的聊天对话数据集。
### 对话文本创作者
语言学家
### 摘要标注者
语言专家
### 标注流程
论文中提及:
每条对话由单人独立创作完成。收集所有对话后,研究团队邀请语言专家为其标注摘要,标注需满足以下要求:(1) 篇幅简洁;(2) 提取核心信息;(3) 包含对话参与者姓名;(4) 采用第三人称书写。每条对话仅对应一条参考摘要。
## 许可信息
非商业许可协议:CC BY-NC-ND 4.0
## 引用信息
@inproceedings{gliwa-etal-2019-samsum,
title = "{SAMS}um Corpus: A Human-annotated Dialogue Dataset for Abstractive Summarization",
author = "Gliwa, Bogdan and
Mochol, Iwona and
Biesek, Maciej and
Wawer, Aleksander",
booktitle = "Proceedings of the 2nd Workshop on New Frontiers in Summarization",
month = nov,
year = "2019",
address = "Hong Kong, China",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/D19-5409",
doi = "10.18653/v1/D19-5409",
pages = "70--79"
}
## 贡献
提供机构:
maas
创建时间:
2025-09-04



