gigant/robust_long_abstractive_human_annotation
收藏Hugging Face2024-04-25 更新2024-06-12 收录
下载链接:
https://hf-mirror.com/datasets/gigant/robust_long_abstractive_human_annotation
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: model_type
dtype: string
- name: dataset
dtype: string
- name: factual_consistency
dtype: float64
- name: relevance
dtype: float64
- name: model_summary
dtype: string
- name: dataset_id
dtype: string
splits:
- name: test
num_bytes: 766812
num_examples: 408
download_size: 234773
dataset_size: 766812
configs:
- config_name: default
data_files:
- split: test
path: data/test-*
---
[Original repository](https://github.com/huankoh/How-Far-are-We-from-Robust-Long-Abstractive-Summarization/tree/main)
# How Far are We from Robust Long Abstractive Summarization? (EMNLP 2022)
[[`Paper`]](https://arxiv.org/abs/2210.16732)
#### Huan Yee Koh<sup>\*</sup>, Jiaxin Ju<sup>\*</sup>, He Zhang, Ming Liu, Shirui Pan ####
(\* denotes equal contribution)
## Human Annotation of Model-Generated Summaries
| **Data Field** | **Definition** |
| :--------: |:---- |
| **dataset** | Whether the model-generated summary is from arXiv or GovReport dataset. |
| **dataset_id** | ID_ + document ID of the dataset. To match the IDs with original datasets, please remove the "ID_" string. The IDs are from the original dataset of [arXiv](https://github.com/armancohan/long-summarization) and [GovReport](https://gov-report-data.github.io/). |
| **model_type** | Model variant which generates the summary. 1K, 4K and 8K represents 1,024, 4096 and 8192 input token limit of the model. For more information, please refer to the original paper. |
| **model_summary** | Model-generated summary |
| **relevance** | Percentage of the reference summary’s main ideas contained in the generated summary. Higher = Better.|
| **factual consistency** | Percentage of factually consistent sentences. Higher = Better. |
## Citation
For more information, please refer to: [<i>How Far are We from Robust Long Abstractive Summarization?</i>](https://arxiv.org/abs/2210.16732)
```
@inproceedings{koh-etal-2022-far,
title = "How Far are We from Robust Long Abstractive Summarization?",
author = "Koh, Huan Yee and Ju, Jiaxin and Zhang, He and Liu, Ming and Pan, Shirui",
booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing",
month = dec,
year = "2022",
address = "Abu Dhabi, United Arab Emirates",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.emnlp-main.172",
pages = "2682--2698"
}
```
提供机构:
gigant
原始信息汇总
数据集概述
数据集特征
- model_type: 字符串类型,表示生成摘要的模型变体,如1K、4K、8K代表模型的输入令牌限制。
- dataset: 字符串类型,指示模型生成的摘要来自arXiv或GovReport数据集。
- factual_consistency: 浮点数类型,表示生成摘要中事实一致句子的百分比。
- relevance: 浮点数类型,表示参考摘要主要思想在生成摘要中的包含百分比。
- model_summary: 字符串类型,包含模型生成的摘要。
- dataset_id: 字符串类型,为数据集ID,格式为ID_ + 文档ID,需移除"ID_"以匹配原始数据集ID。
数据集分割
- test: 包含408个示例,总大小为766812字节。
数据集大小
- 下载大小: 234773字节
- 数据集大小: 766812字节
配置
- config_name: default
- data_files:
- split: test
- path: data/test-*



