neil-code/dialogsum-test

Name: neil-code/dialogsum-test
Creator: neil-code
Published: 2023-08-24 03:47:07
License: 暂无描述

Hugging Face2023-08-24 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/neil-code/dialogsum-test

下载链接

链接失效反馈

官方服务：

资源简介：

DIALOGSum是一个大规模的对话摘要数据集，包含13,460个对话（外加100个用于主题生成的保留数据）及其对应的人工标注的摘要和主题。数据集来源于多个公开的对话语料库，涵盖了日常生活中的多种场景，如学校、工作、医疗、购物、休闲、旅行等。对话通常发生在朋友、同事以及服务提供者与顾客之间。与之前的对话数据集相比，DIALOGSum的对话具有更丰富的现实生活场景、更清晰的沟通模式和意图，以及适合自动摘要的合理长度。每个对话的摘要由语言专家根据特定标准进行标注，包括传达最重要的信息、简洁、保留对话中的重要命名实体、以观察者视角书写、使用正式语言等。

提供机构：

neil-code

原始信息汇总

DIALOGSum Corpus 数据集概述

数据集描述

数据集摘要

DialogSum 是一个大规模的对话摘要数据集，包含 13,460 个对话及其对应的手动标注摘要和主题。此外，还有 100 个用于主题生成的保留数据。

语言

英语

数据集结构

数据实例

DialogSum 数据集包含 13,460 个对话（外加 1000 个测试数据），分为训练集、测试集和验证集。

训练集中的第一个实例： json { "id": "train_0", "summary": "Mr. Smiths getting a check-up, and Doctor Hawkins advises him to have one every year. Hawkinsll give some information about their classes and medications to help Mr. Smith quit smoking.", "dialogue": "#Person1#: Hi, Mr. Smith. Im Doctor Hawkins. Why are you here today? #Person2#: I found it would be a good idea to get a check-up. #Person1#: Yes, well, you havent had one for 5 years. You should have one every year. #Person2#: I know. I figure as long as there is nothing wrong, why go see the doctor? #Person1#: Well, the best way to avoid serious illnesses is to find out about them early. So try to come at least once a year for your own good. #Person2#: Ok. #Person1#: Let me see here. Your eyes and ears look fine. Take a deep breath, please. Do you smoke, Mr. Smith? #Person2#: Yes. #Person1#: Smoking is the leading cause of lung cancer and heart disease, you know. You really should quit. #Person2#: Ive tried hundreds of times, but I just cant seem to kick the habit. #Person1#: Well, we have classes and some medications that might help. Ill give you more information before you leave. #Person2#: Ok, thanks doctor.", "topic": "get a check-up" }

数据字段

dialogue: 对话文本。
summary: 人工编写的对话摘要。
topic: 人工编写的对话主题/一句话概括。
id: 示例的唯一文件ID。

数据分割

训练集: 12460
验证集: 500
测试集: 1500
保留集: 100（仅包含 id, dialogue, topic 三个字段）

数据集创建

策划理由

DialogSum 数据集从三个公共对话语料库（Dailydialog、DREAM 和 MuTual）以及一个英语口语练习网站收集对话数据。这些数据集包含面对面口语对话，涵盖日常生活主题，如学校、工作、医疗、购物、休闲、旅行等。大多数对话发生在朋友、同事以及服务提供者和顾客之间。

与先前的数据集相比，DialogSum 的对话具有以下特点：

在丰富的真实生活场景中，包括更多样化的任务导向场景；
具有清晰的沟通模式和意图，适合作为摘要的来源；
具有合理的长度，适合自动摘要的目的。

标注者根据以下标准对每个对话进行摘要：

传达最重要的信息；
简洁；
保留对话中的重要命名实体；
从观察者角度编写；
使用正式语言。

源语言生产者

语言学家

标注者

语言专家

许可信息

MIT 许可证

5,000+

优质数据集

54 个

任务类型

进入经典数据集