kundank/usb

Name: kundank/usb
Creator: kundank
Published: 2023-12-09 03:47:33
License: 暂无描述

Hugging Face2023-12-09 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/kundank/usb

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 task_categories: - summarization language: - en tags: - factchecking - summarization - nli size_categories: - 1K<n<10K --- # USB: A Unified Summarization Benchmark Across Tasks and Domains This benchmark contains labeled datasets for 8 text summarization based tasks given below. The labeled datasets are created by collecting manual annotations on top of Wikipedia articles from 6 different domains. |Task |Description |Code snippet | |----------------|-------------------------------|-----------------------------| | Extractive Summarization | Highlight important sentences in the source article | `load_dataset("kundank/usb","extractive_summarization")` | | Abstractive Summarization | Generate a summary of the source | `load_dataset("kundank/usb","abstractive_summarization")` | | Topic-based Summarization | Generate a summary of the source focusing on the given topic | `load_dataset("kundank/usb","topicbased_summarization")` | | Multi-sentence Compression | Compress selected sentences into a one-line summary | `load_dataset("kundank/usb","multisentence_compression")` | | Evidence Extraction | Surface evidence from the source for a summary sentence | `load_dataset("kundank/usb","evidence_extraction")` | | Factuality Classification | Predict the factual accuracy of a summary sentence with respect to provided evidence | `load_dataset("kundank/usb","factuality_classification")` | | Unsupported Span Prediction | Identify spans in a summary sentence which are not substantiated by the provided evidence | `load_dataset("kundank/usb","unsupported_span_prediction")` | | Fixing Factuality | Rewrite a summary sentence to remove any factual errors or unsupported claims, with respect to provided evidence | `load_dataset("kundank/usb","fixing_factuality")` | Additionally, to load the full set of collected annotations which were leveraged to make the labeled datasets for above tasks, use the command: ``load_dataset("kundank/usb","all_annotations")`` ## Trained models We fine-tuned Flan-T5-XL models on the training set of each task in the benchmark. They are available at the links given below: |Task |Finetuned Flan-T5-XL model | |----------------|-----------------------------| | Extractive Summarization | [link](https://huggingface.co/kundank/usb-extractive_summarization-flant5xl) | | Abstractive Summarization | [link](https://huggingface.co/kundank/usb-abstractive_summarization-flant5xl) | | Topic-based Summarization | [link](https://huggingface.co/kundank/usb-topicbased_summarization-flant5xl) | | Multi-sentence Compression | [link](https://huggingface.co/kundank/usb-multisentence_compression-flant5xl) | | Evidence Extraction | [link](https://huggingface.co/kundank/usb-evidence_extraction-flant5xl) | | Factuality Classification | [link](https://huggingface.co/kundank/usb-factuality_classification-flant5xl) | | Unsupported Span Prediction | [link](https://huggingface.co/kundank/usb-unsupported_span_prediction-flant5xl) | | Fixing Factuality | [link](https://huggingface.co/kundank/usb-fixing_factuality-flant5xl) | More details can be found in the paper: https://aclanthology.org/2023.findings-emnlp.592/ If you use this dataset, please cite it as below: ``` @inproceedings{krishna-etal-2023-usb, title = "{USB}: A Unified Summarization Benchmark Across Tasks and Domains", author = "Krishna, Kundan and Gupta, Prakhar and Ramprasad, Sanjana and Wallace, Byron and Bigham, Jeffrey and Lipton, Zachary", booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2023", year = "2023", pages = "8826--8845" } ```

提供机构：

kundank

原始信息汇总

USB: 统一摘要基准数据集

数据集概述

许可证：Apache 2.0
任务类别：摘要
语言：英语
标签：事实检查、摘要、自然语言推理
数据集大小：1K<n<10K

数据集详情

该基准包含8个基于文本摘要任务的标记数据集，这些数据集是通过在来自6个不同领域的维基百科文章上收集人工标注创建的。

任务列表

任务名称	描述	代码片段
抽取式摘要	在源文章中突出重要句子	`load_dataset("kundank/usb","extractive_summarization")`
抽象式摘要	生成源文章的摘要	`load_dataset("kundank/usb","abstractive_summarization")`
基于主题的摘要	生成聚焦于给定主题的源文章摘要	`load_dataset("kundank/usb","topicbased_summarization")`
多句子压缩	将选定的句子压缩成一行摘要	`load_dataset("kundank/usb","multisentence_compression")`
证据提取	从源文章中提取支持摘要句子的证据	`load_dataset("kundank/usb","evidence_extraction")`
事实性分类	预测摘要句子相对于提供证据的事实准确性	`load_dataset("kundank/usb","factuality_classification")`
不支持的跨度预测	识别摘要句子中未被提供证据支持的跨度	`load_dataset("kundank/usb","unsupported_span_prediction")`
修正事实性	重写摘要句子以去除任何事实错误或不支持的主张，相对于提供的证据	`load_dataset("kundank/usb","fixing_factuality")`

加载完整标注集

要加载用于创建上述任务标记数据集的完整收集标注，请使用以下命令： load_dataset("kundank/usb","all_annotations")

5,000+

优质数据集

54 个

任务类型

进入经典数据集