five

kundank/usb

收藏
Hugging Face2023-12-09 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/kundank/usb
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 task_categories: - summarization language: - en tags: - factchecking - summarization - nli size_categories: - 1K<n<10K --- # USB: A Unified Summarization Benchmark Across Tasks and Domains This benchmark contains labeled datasets for 8 text summarization based tasks given below. The labeled datasets are created by collecting manual annotations on top of Wikipedia articles from 6 different domains. |Task |Description |Code snippet | |----------------|-------------------------------|-----------------------------| | Extractive Summarization | Highlight important sentences in the source article | `load_dataset("kundank/usb","extractive_summarization")` | | Abstractive Summarization | Generate a summary of the source | `load_dataset("kundank/usb","abstractive_summarization")` | | Topic-based Summarization | Generate a summary of the source focusing on the given topic | `load_dataset("kundank/usb","topicbased_summarization")` | | Multi-sentence Compression | Compress selected sentences into a one-line summary | `load_dataset("kundank/usb","multisentence_compression")` | | Evidence Extraction | Surface evidence from the source for a summary sentence | `load_dataset("kundank/usb","evidence_extraction")` | | Factuality Classification | Predict the factual accuracy of a summary sentence with respect to provided evidence | `load_dataset("kundank/usb","factuality_classification")` | | Unsupported Span Prediction | Identify spans in a summary sentence which are not substantiated by the provided evidence | `load_dataset("kundank/usb","unsupported_span_prediction")` | | Fixing Factuality | Rewrite a summary sentence to remove any factual errors or unsupported claims, with respect to provided evidence | `load_dataset("kundank/usb","fixing_factuality")` | Additionally, to load the full set of collected annotations which were leveraged to make the labeled datasets for above tasks, use the command: ``load_dataset("kundank/usb","all_annotations")`` ## Trained models We fine-tuned Flan-T5-XL models on the training set of each task in the benchmark. They are available at the links given below: |Task |Finetuned Flan-T5-XL model | |----------------|-----------------------------| | Extractive Summarization | [link](https://huggingface.co/kundank/usb-extractive_summarization-flant5xl) | | Abstractive Summarization | [link](https://huggingface.co/kundank/usb-abstractive_summarization-flant5xl) | | Topic-based Summarization | [link](https://huggingface.co/kundank/usb-topicbased_summarization-flant5xl) | | Multi-sentence Compression | [link](https://huggingface.co/kundank/usb-multisentence_compression-flant5xl) | | Evidence Extraction | [link](https://huggingface.co/kundank/usb-evidence_extraction-flant5xl) | | Factuality Classification | [link](https://huggingface.co/kundank/usb-factuality_classification-flant5xl) | | Unsupported Span Prediction | [link](https://huggingface.co/kundank/usb-unsupported_span_prediction-flant5xl) | | Fixing Factuality | [link](https://huggingface.co/kundank/usb-fixing_factuality-flant5xl) | More details can be found in the paper: https://aclanthology.org/2023.findings-emnlp.592/ If you use this dataset, please cite it as below: ``` @inproceedings{krishna-etal-2023-usb, title = "{USB}: A Unified Summarization Benchmark Across Tasks and Domains", author = "Krishna, Kundan and Gupta, Prakhar and Ramprasad, Sanjana and Wallace, Byron and Bigham, Jeffrey and Lipton, Zachary", booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2023", year = "2023", pages = "8826--8845" } ```
提供机构:
kundank
原始信息汇总

USB: 统一摘要基准数据集

数据集概述

  • 许可证:Apache 2.0
  • 任务类别:摘要
  • 语言:英语
  • 标签:事实检查、摘要、自然语言推理
  • 数据集大小:1K<n<10K

数据集详情

该基准包含8个基于文本摘要任务的标记数据集,这些数据集是通过在来自6个不同领域的维基百科文章上收集人工标注创建的。

任务列表

任务名称 描述 代码片段
抽取式摘要 在源文章中突出重要句子 load_dataset("kundank/usb","extractive_summarization")
抽象式摘要 生成源文章的摘要 load_dataset("kundank/usb","abstractive_summarization")
基于主题的摘要 生成聚焦于给定主题的源文章摘要 load_dataset("kundank/usb","topicbased_summarization")
多句子压缩 将选定的句子压缩成一行摘要 load_dataset("kundank/usb","multisentence_compression")
证据提取 从源文章中提取支持摘要句子的证据 load_dataset("kundank/usb","evidence_extraction")
事实性分类 预测摘要句子相对于提供证据的事实准确性 load_dataset("kundank/usb","factuality_classification")
不支持的跨度预测 识别摘要句子中未被提供证据支持的跨度 load_dataset("kundank/usb","unsupported_span_prediction")
修正事实性 重写摘要句子以去除任何事实错误或不支持的主张,相对于提供的证据 load_dataset("kundank/usb","fixing_factuality")

加载完整标注集

要加载用于创建上述任务标记数据集的完整收集标注,请使用以下命令: load_dataset("kundank/usb","all_annotations")

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作