google-research-datasets/xsum_factuality

Name: google-research-datasets/xsum_factuality
Creator: google-research-datasets
Published: 2024-01-18 11:18:47
License: 暂无描述

Hugging Face2024-01-18 更新2024-06-15 收录

下载链接：

https://hf-mirror.com/datasets/google-research-datasets/xsum_factuality

下载链接

链接失效反馈

官方服务：

资源简介：

XSum Hallucination Annotations数据集是一个用于评估神经抽象摘要模型忠实性和事实性的大规模人工标注数据集。该数据集包含对XSum数据集中摘要的忠实性和事实性标注，由众包完成，包含500个文档-系统对的3个标注。数据集分为忠实性标注和事实性标注两个部分，每个部分都有详细的数据字段描述和数据分割信息。忠实性标注部分包括文档ID、摘要系统、生成的摘要、幻觉类型、幻觉跨度起始和结束位置以及标注者ID。事实性标注部分包括文档ID、摘要系统、生成的摘要、摘要是否事实以及标注者ID。

The XSum Hallucination Annotations dataset is a large-scale manually annotated dataset designed to evaluate the faithfulness and factuality of neural abstractive summarization models. This dataset contains faithfulness and factuality annotations for summaries sourced from the XSum dataset. These annotations were completed via crowdsourcing, with 3 annotations provided for each of the 500 document-system pairs. The dataset is divided into two subsets: the faithfulness annotation subset and the factuality annotation subset, each featuring detailed data field descriptions and data split information. The faithfulness annotation subset includes document ID, summary system, generated summary, hallucination type, start and end positions of hallucination spans, and annotator ID. The factuality annotation subset includes document ID, summary system, generated summary, whether the summary is factual, and annotator ID.

提供机构：

google-research-datasets

原始信息汇总

数据集概述

数据集描述

数据集摘要

该数据集包含对多个神经抽象摘要系统的大型人工评估，旨在更好地理解它们产生的幻觉类型。数据集包括对XSum数据集的抽象摘要的忠实度和事实性标注。每个文档-系统对有3个众包标注。

支持的任务和排行榜

summarization：用于训练摘要模型，通常通过ROUGE分数来衡量成功。

语言

数据集中的文本为英语，是XSum数据集的抽象摘要。

数据集结构

数据实例

忠实度标注数据集

一个典型的数据点包括新闻文章的ID（完整文档）、摘要和幻觉跨度信息。

示例： json { "bbcid": 34687720, "hallucinated_span_end": 114, "hallucinated_span_start": 1, "hallucination_type": 1, "summary": "rory mcilroy will take a one-shot lead into the final round of the wgc-hsbc champions after carding a three-under", "system": "BERTS2S", "worker_id": "wid_0" }

事实性标注数据集

一个典型的数据点包括新闻文章的ID（完整文档）、摘要和摘要是否事实性的信息。

示例： json { "bbcid": 29911712, "is_factual": 0, "summary": "more than 50 pupils at a bristol academy have been sent home from school because of a lack of uniform.", "system": "BERTS2S", "worker_id": "wid_0" }

数据字段

忠实度标注数据集

bbcid：XSum语料库中的文档ID。
system：神经摘要器的名称。
summary：由‘system’生成的摘要。
hallucination_type：幻觉类型：内在（0）或外在（1）。
hallucinated_span_start：幻觉跨度的起始索引。
hallucinated_span_end：幻觉跨度的结束索引。
worker_id：工作者ID（wid_0, wid_1, wid_2之一）。

事实性标注数据集

bbcid：XSum语料库中的文档ID。
system：神经摘要器的名称。
summary：由‘system’生成的摘要。
is_factual：是否事实性（是（1）或否（0））。
worker_id：工作者ID（wid_0, wid_1, wid_2之一）。

数据分割

数据集只有一个训练分割。

	train
忠实度标注数据集	11185
事实性标注数据集	5597

数据集创建

数据集来源

数据集扩展自XSum数据集。

标注过程

数据集的标注由众包完成。

许可证信息

数据集使用Creative Commons Attribution 4.0 International许可证。

引用信息

@InProceedings{maynez_acl20, author = "Joshua Maynez and Shashi Narayan and Bernd Bohnet and Ryan Thomas Mcdonald", title = "On Faithfulness and Factuality in Abstractive Summarization", booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics", year = "2020", pages = "1906--1919", address = "Online", }

搜集汇总

数据集介绍

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集