ccdv/pubmed-summarization

Name: ccdv/pubmed-summarization
Creator: ccdv
Published: 2022-10-24 20:33:04
License: 暂无描述

Hugging Face2022-10-24 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/ccdv/pubmed-summarization

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集用于长文档的摘要生成，适用于文本摘要和文本生成任务。数据集包含文章的ID、正文和摘要字段，并分为训练集、验证集和测试集三个部分。数据集来源于一个特定的GitHub仓库，并且与HuggingFace的Transformers库中的`run_summarization.py`脚本兼容。数据集的原始数据是预分词的，因此在返回时会用空格连接文本，并在段落之间添加换行符。

This dataset is designed for long-document summarization, and is suitable for text summarization and text generation tasks. It contains fields including article ID, full text, and summary, and is split into three subsets: training set, validation set, and test set. The dataset originates from a specific GitHub repository, and is compatible with the `run_summarization.py` script in the HuggingFace Transformers library. The raw data of this dataset is pre-tokenized, so the returned text will be joined with spaces, and newline characters will be added between paragraphs.

提供机构：

ccdv

原始信息汇总

数据集概述

基本信息

语言: 英语
多语言性: 单语种
大小: 100K<n<1M
任务类别:
- 摘要生成
- 文本生成
标签: 条件文本生成

数据集描述

名称: PubMed数据集用于摘要生成
用途: 用于长文档的摘要生成
原始数据源: GitHub仓库
数据处理: 原始数据预先分词，数据集返回" ".join(text)并添加" "以表示段落
兼容性: 与run_summarization.py脚本兼容，需在summarization_name_mapping变量中添加配置

数据字段

id: 论文ID
article: 包含论文主体的字符串
abstract: 包含论文摘要的字符串

数据分割

分割: 训练集、验证集、测试集
统计信息:

分割实例数平均词数

训练 119,924 3043 / 215

验证 6,633 3111 / 216

测试 6,658 3092 / 219

引用信息

@inproceedings{cohan-etal-2018-discourse, title = "A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents", author = "Cohan, Arman and Dernoncourt, Franck and Kim, Doo Soon and Bui, Trung and Kim, Seokhwan and Chang, Walter and Goharian, Nazli", booktitle = "Proceedings of the 2018 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)", month = jun, year = "2018", address = "New Orleans, Louisiana", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/N18-2097", doi = "10.18653/v1/N18-2097", pages = "615--621", abstract = "Neural abstractive summarization models have led to promising results in summarizing relatively short documents. We propose the first model for abstractive summarization of single, longer-form documents (e.g., research papers). Our approach consists of a new hierarchical encoder that models the discourse structure of a document, and an attentive discourse-aware decoder to generate the summary. Empirical results on two large-scale datasets of scientific papers show that our model significantly outperforms state-of-the-art models.", }

搜集汇总

数据集介绍

背景与挑战

背景概述

The 'ccdv/pubmed-summarization' dataset is a collection of research papers and their abstracts from PubMed, intended for training models in abstractive summarization. It features long documents with an average of over 3,000 tokens per article and includes splits for training, validation, and testing. The dataset is notable for its application in hierarchical encoder models that consider the discourse structure of documents for improved summarization.

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集

分割	实例数	平均词数
训练	119,924	3043 / 215
验证	6,633	3111 / 216
测试	6,658	3092 / 219