five

ccdv/arxiv-summarization

收藏
Hugging Face2022-12-08 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/ccdv/arxiv-summarization
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en multilinguality: - monolingual size_categories: - 100K<n<1M task_categories: - summarization - text-generation task_ids: [] tags: - conditional-text-generation train-eval-index: - config: document task: summarization task_id: summarization splits: eval_split: test col_mapping: article: text abstract: target --- # Arxiv dataset for summarization Dataset for summarization of long documents.\ Adapted from this [repo](https://github.com/armancohan/long-summarization).\ Note that original data are pre-tokenized so this dataset returns " ".join(text) and add "\n" for paragraphs. \ This dataset is compatible with the [`run_summarization.py`](https://github.com/huggingface/transformers/tree/master/examples/pytorch/summarization) script from Transformers if you add this line to the `summarization_name_mapping` variable: ```python "ccdv/arxiv-summarization": ("article", "abstract") ``` ### Data Fields - `id`: paper id - `article`: a string containing the body of the paper - `abstract`: a string containing the abstract of the paper ### Data Splits This dataset has 3 splits: _train_, _validation_, and _test_. \ Token counts are white space based. | Dataset Split | Number of Instances | Avg. tokens | | ------------- | --------------------|:----------------------| | Train | 203,037 | 6038 / 299 | | Validation | 6,436 | 5894 / 172 | | Test | 6,440 | 5905 / 174 | # Cite original article ``` @inproceedings{cohan-etal-2018-discourse, title = "A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents", author = "Cohan, Arman and Dernoncourt, Franck and Kim, Doo Soon and Bui, Trung and Kim, Seokhwan and Chang, Walter and Goharian, Nazli", booktitle = "Proceedings of the 2018 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)", month = jun, year = "2018", address = "New Orleans, Louisiana", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/N18-2097", doi = "10.18653/v1/N18-2097", pages = "615--621", abstract = "Neural abstractive summarization models have led to promising results in summarizing relatively short documents. We propose the first model for abstractive summarization of single, longer-form documents (e.g., research papers). Our approach consists of a new hierarchical encoder that models the discourse structure of a document, and an attentive discourse-aware decoder to generate the summary. Empirical results on two large-scale datasets of scientific papers show that our model significantly outperforms state-of-the-art models.", } ```
提供机构:
ccdv
原始信息汇总

数据集概述

  • 语言: 英语
  • 多语言性: 单语种
  • 大小: 10万<n<100万
  • 任务类别: 摘要生成、文本生成
  • 标签: 条件文本生成

数据集详情

  • 任务: 摘要生成
  • 训练/评估索引:
    • 配置: 文档
    • 任务ID: 摘要生成
    • 分割:
      • 评估分割: 测试
    • 列映射:
      • article: 文本
      • abstract: 目标

数据字段

  • id: 论文ID
  • article: 包含论文主体的字符串
  • abstract: 包含论文摘要的字符串

数据分割

  • 分割: 训练、验证、测试
  • 实例数量及平均令牌数:
    • 训练: 203,037实例, 平均6038/299令牌
    • 验证: 6,436实例, 平均5894/172令牌
    • 测试: 6,440实例, 平均5905/174令牌

引用信息

@inproceedings{cohan-etal-2018-discourse, title = "A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents", author = "Cohan, Arman and Dernoncourt, Franck and Kim, Doo Soon and Bui, Trung and Kim, Seokhwan and Chang, Walter and Goharian, Nazli", booktitle = "Proceedings of the 2018 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)", month = jun, year = "2018", address = "New Orleans, Louisiana", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/N18-2097", doi = "10.18653/v1/N18-2097", pages = "615--621", abstract = "Neural abstractive summarization models have led to promising results in summarizing relatively short documents. We propose the first model for abstractive summarization of single, longer-form documents (e.g., research papers). Our approach consists of a new hierarchical encoder that models the discourse structure of a document, and an attentive discourse-aware decoder to generate the summary. Empirical results on two large-scale datasets of scientific papers show that our model significantly outperforms state-of-the-art models.", }

搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
该数据集是用于长文档摘要任务的ArXiv论文数据集,包含超过20万篇论文的正文和摘要,数据已预分词并适配于Transformers框架。数据集分为训练、验证和测试集,主要用于训练和评估抽象摘要模型,支持条件文本生成任务。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作