neil-code/dialogsum-test
收藏DIALOGSum Corpus 数据集概述
数据集描述
数据集摘要
DialogSum 是一个大规模的对话摘要数据集,包含 13,460 个对话及其对应的手动标注摘要和主题。此外,还有 100 个用于主题生成的保留数据。
语言
英语
数据集结构
数据实例
DialogSum 数据集包含 13,460 个对话(外加 1000 个测试数据),分为训练集、测试集和验证集。
训练集中的第一个实例: json { "id": "train_0", "summary": "Mr. Smiths getting a check-up, and Doctor Hawkins advises him to have one every year. Hawkinsll give some information about their classes and medications to help Mr. Smith quit smoking.", "dialogue": "#Person1#: Hi, Mr. Smith. Im Doctor Hawkins. Why are you here today? #Person2#: I found it would be a good idea to get a check-up. #Person1#: Yes, well, you havent had one for 5 years. You should have one every year. #Person2#: I know. I figure as long as there is nothing wrong, why go see the doctor? #Person1#: Well, the best way to avoid serious illnesses is to find out about them early. So try to come at least once a year for your own good. #Person2#: Ok. #Person1#: Let me see here. Your eyes and ears look fine. Take a deep breath, please. Do you smoke, Mr. Smith? #Person2#: Yes. #Person1#: Smoking is the leading cause of lung cancer and heart disease, you know. You really should quit. #Person2#: Ive tried hundreds of times, but I just cant seem to kick the habit. #Person1#: Well, we have classes and some medications that might help. Ill give you more information before you leave. #Person2#: Ok, thanks doctor.", "topic": "get a check-up" }
数据字段
dialogue: 对话文本。summary: 人工编写的对话摘要。topic: 人工编写的对话主题/一句话概括。id: 示例的唯一文件ID。
数据分割
- 训练集: 12460
- 验证集: 500
- 测试集: 1500
- 保留集: 100(仅包含
id,dialogue,topic三个字段)
数据集创建
策划理由
DialogSum 数据集从三个公共对话语料库(Dailydialog、DREAM 和 MuTual)以及一个英语口语练习网站收集对话数据。这些数据集包含面对面口语对话,涵盖日常生活主题,如学校、工作、医疗、购物、休闲、旅行等。大多数对话发生在朋友、同事以及服务提供者和顾客之间。
与先前的数据集相比,DialogSum 的对话具有以下特点:
- 在丰富的真实生活场景中,包括更多样化的任务导向场景;
- 具有清晰的沟通模式和意图,适合作为摘要的来源;
- 具有合理的长度,适合自动摘要的目的。
标注者根据以下标准对每个对话进行摘要:
- 传达最重要的信息;
- 简洁;
- 保留对话中的重要命名实体;
- 从观察者角度编写;
- 使用正式语言。
源语言生产者
语言学家
标注者
语言专家
许可信息
MIT 许可证



