Awesome-Summarization-Datasets
收藏Awesome-Summarization-Datasets
数据集概述
该数据集集合是基于调查报告《The State and Fate of Summarization Datasets》的结果整理而成。
引用
如果该调查报告对您的研究有贡献,请在您的作品中引用以下论文: bibtex @misc{dahan2024statefatesummarizationdatasets, title={The State and Fate of Summarization Datasets}, author={Noam Dahan and Gabriel Stanovsky}, year={2024}, eprint={2411.04585}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2411.04585}, }
数据卡模板
为标准化数据集描述,推荐使用以下数据卡模板: latex egin{table}[tb!]
esizebox{columnwidth}{!}{% egin{tabular}{|p{7.5cm}|}
hline
extbf{Summarization Data Card} \ hline
extbf{underline{Sample information:}}
extbf{Languages:}
ewline
extit{List all supported languages}
extbf{Summary Shape:}
ewline
extit{Paragraph/One Sentence/Highlights/Span}
extbf{Domain:}
ewline
extit{Example: News/Scientific/Dialogues/etc.}
extbf{Size:}
ewline
extit{Number of document-summary pairs} \ hline
extbf{underline{Annotation information:}}
egin{tabular}[|p{7.5cm}|]{@{}l@{}} extbf{Annotation efforts:} \ extit{Automatic, Human annotations, Semi-automatic}end{tabular}
egin{tabular}[|p{7.5cm}|]{@{}l@{}} extbf{Source of supervision:}\ extit{Natural} (summaries created organically)/ \ extit{Distant} (annotations are proxies of summaries)/\ extit{Dedicated} (annotations created by researchers)end{tabular}
egin{tabular}[|p{7.5cm}|]{@{}l@{}} extbf{Brief description of the summaries source:} \ extit{Example:
digests of legal documents}end{tabular} \ hline
extbf{underline{Data quality assessment:}}
egin{tabular}[|p{7.5cm}|]{@{}l@{}} extbf{Abstraction level:} \ extit{1-to-4-gram ratios} end{tabular}
extbf{Compression rate:}
$ frac{ ext{doc length (#words)}}{ ext{summary length (#words)}}$
extbf{Human evaluation:} extit{Yes/No} \ hline
extbf{underline{Availability details:}}
egin{tabular}[|p{7.5cm}|]{@{}l@{}} extbf{How is the data made accessible:} \ extit{Publicly Available} /
extit{URL-based Reconstruction} / \ extit{Upon Request}end{tabular}
egin{tabular}[c]{@{}l@{}} extbf{Copyrights information:} \ extit{License}end{tabular} \ hline
end{tabular}%
}
caption{Template for a summarization data card.}
label{tab:datacard}
end{table}
数据集列表
以下是包含段落输出文本的数据集列表:
| 数据集名称 | 论文 | 语言 | 语言模式 | 领域 | 监督来源 | 标注努力 | 可用性 | 子任务 | 样本数量 | 人工评估 |
|---|---|---|---|---|---|---|---|---|---|---|
| DUC 2001-2007 | The Document Understanding Conference (DUC) | English | Monolingual | News | Dedicated | Human | Upon Request | Multidocument, Query-focused | 45 | - |
| MultiLing 2013 | Multi-document multilingual summarization and evaluation tracks in ACL 2013 MultiLing Workshop | Arabic, Czech, English, French, Modern Greek, Hebrew, Hindi, Chinese, Romanian, Spanish | Multilingual | News | Naturally | Automatic | Upon Request | Multidocument | 150 | - |
| MultiLing 2015, 2017 | MultiLing 2017 Overview | Afrikaans, Arabic, Azerbaijani, Bulgarian, Bosnian, Catalan, Czech, German, Modern Greek, English, Esperanto, Spanish, Basque, Persian, Finnish, French, Croatian, Indonesian, Italian, Japanese, Javanese, Georgian, Korean, Limburgish, Latvian, Marathi, Malay, Dutch, Norwegian, Polish, Portuguese, Romanian, Russian, Slovak, Thai, Turkish, Tagalog, Ukrainian, Chinese | Multilingual | Encyclopedia | Distant | Automatic | Upon Request | - | 38 | - |
| New York Times Corpus | The New York Times Annotated Corpus | English | Monolingual | News | Dedicated | Human | Not Sure It Is Still Availiable | - | 650,000 | - |
| NEWSROOM | Newsroom: A Dataset of 1.3 Million Summaries with Diverse Extractive Strategies | English | Monolingual | News | Distant | Automatic | URL-based reconstruction | - | 1,321,995 | - |
| DaNewsroom | DaNewsroom: A Large-scale Danish Summarisation Dataset | Danish | Monolingual | News | Distant | Automatic | URL-based reconstruction | - | 1,132,734 | - |
| Multi-News | Multi-News: a Large-Scale Multi-Document Summarization Dataset and Abstractive Hierarchical Model | English | Monolingual | News | Naturally | Automatic | Publicly Available (License) | Multidocument Summarization | 250,000 | - |
| DACSA | DACSA: A large-scale Dataset for Automatic summarization of Catalan and Spanish newspaper Articles | Catalan, Spanish | Multilingual | News | Naturally | Automatic | Upon Request | - | 2,845,833 | - |
| MENSA | Select and Summarize: Scene Saliency for Movie Script Summarization | English | Monolingual | Movie Scripts | Dedicated | Human | Upon Request | Scene Saliency | 1,000 | - |




