bloomberg/entsum
收藏数据集概述
数据集名称
EntSUM: A Data Set for Entity-Centric Extractive Summarization
作者
Mounica Maddela*, Mayank Kulkarni*, Daniel Preotiuc-Pietro
描述
EntSUM是一个专注于实体控制的可控摘要数据集。该数据集旨在根据用户指定的实体方面和偏好生成摘要,与传统的单一通用摘要不同。数据集包含三个JSON文件,分别对应不同的摘要标注类型。每个文件包含文档ID、句子ID、摘要、重要句子和摘要对应的句子ID。数据集的源文本可通过下载NYT语料库并映射文档ID获得。
语言
英语
关键词
自然语言处理, 摘要, 抽象摘要, 提取摘要
相关标识符
数据集源自NYT语料库,相关许可证信息可在LDC网站获取。
引用信息
@inproceedings{maddela-etal-2022-entsum, title = "{E}nt{SUM}: A Data Set for Entity-Centric Extractive Summarization", author = "Maddela, Mounica and Kulkarni, Mayank and Preotiuc-Pietro, Daniel", booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)", month = may, year = "2022", address = "Dublin, Ireland", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2022.acl-long.237", pages = "3355--3366", abstract = "Controllable summarization aims to provide summaries that take into account user-specified aspects and preferences to better assist them with their information need, as opposed to the standard summarization setup which build a single generic summary of a document.We introduce a human-annotated data set EntSUM for controllable summarization with a focus on named entities as the aspects to control.We conduct an extensive quantitative analysis to motivate the task of entity-centric summarization and show that existing methods for controllable summarization fail to generate entity-centric summaries. We propose extensions to state-of-the-art summarization approaches that achieve substantially better results on our data set. Our analysis and results show the challenging nature of this task and of the proposed data set.", }



