launch/gov_report_qs
收藏数据集概述
数据集名称
- GovReport-QS
数据集摘要
- 基于GovReport数据集,GovReport-QS额外包含了政府报告的注释问题-摘要层次结构。这种层次结构主动突出了文档结构,进一步促进内容参与和理解。
语言
- 英语
许可证
- CC BY 4.0
多语言性
- 单语种
大小分类
- 10K<n<100K
源数据集
- launch/gov_report
任务类别
- 摘要生成
数据集结构
数据实例配置
- paragraph (默认): 段落级别的注释数据
- document: 同一文档的段落级别注释数据聚合
数据字段
paragraph
doc_id: 字符串类型summary_paragraph_index: 整数类型document_sections: 字典类型,包含标题、段落和深度信息question_summary_pairs: 字典类型,包含问题、摘要和父对索引
document
id: 字符串类型document_sections: 字典类型,包含标题、段落、深度和校准信息question_summary_pairs: 字典类型,包含问题、摘要、父对索引和摘要段落索引
数据分割
paragraph
- 训练集: 17519
- 验证集: 974
- 测试集: 973
document
- 训练集: 1371
- 验证集: 171
- 测试集: 172
数据集创建
源语言生产者
- 国会研究服务部和美国政府问责局的编辑
许可证信息
- CC BY 4.0
引用信息
@inproceedings{cao-wang-2022-hibrids, title = "{HIBRIDS}: Attention with Hierarchical Biases for Structure-aware Long Document Summarization", author = "Cao, Shuyang and Wang, Lu", booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)", month = may, year = "2022", address = "Dublin, Ireland", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2022.acl-long.58", pages = "786--807", abstract = "Document structure is critical for efficient information consumption. However, it is challenging to encode it efficiently into the modern Transformer architecture. In this work, we present HIBRIDS, which injects Hierarchical Biases foR Incorporating Document Structure into attention score calculation. We further present a new task, hierarchical question-summary generation, for summarizing salient content in the source document into a hierarchy of questions and summaries, where each follow-up question inquires about the content of its parent question-summary pair. We also annotate a new dataset with 6,153 question-summary hierarchies labeled on government reports. Experiment results show that our model produces better question-summary hierarchies than comparisons on both hierarchy quality and content coverage, a finding also echoed by human judges. Additionally, our model improves the generation of long-form summaries from long government reports and Wikipedia articles, as measured by ROUGE scores.", }




