ArXiv英文论文摘要数据集
收藏国家基础学科公共科学数据中心2025-12-06 收录
下载链接:
https://nbsdc.cn/general/dataDetail?id=6931b013195d2658bc1e5f87&type=1
下载链接
链接失效反馈官方服务:
资源简介:
数据来源于开源的Arxiv数据源,使用Arxiv官方提供的数据下载接口与地址获取ArXiv论文的元数据,并通过人工编写程序的方式进行多角度摘要分割和筛选存储。共10万条文献多角度摘要数据,包含文献标题、摘要等基本元数据信息以及背景、方法和结果三个角度的细粒度摘要。该数据集面向科研文献理解、学术摘要分割与大模型语义学习等任务,可用于训练和评估模型在科学文献中的多角度语义识别、表示学习与细粒度检索能力。例如,可作为对比学习、自动摘要、文献分类、信息抽取以及问答式检索(question–document retrieval)任务的核心训练语料,为构建具备学术语义理解能力的语言模型提供数据支持。
This dataset is sourced from the open-source arXiv data repository. We utilize the official download interfaces and addresses provided by arXiv to retrieve the metadata of arXiv papers, and conduct multi-angle abstract segmentation, filtering and storage via manually developed programs. It consists of 100,000 pieces of multi-angle abstract data for scholarly literature, covering basic metadata such as paper titles and standard abstracts, alongside fine-grained abstracts from three perspectives: background, methodology and results. This dataset is targeted for tasks including scientific literature understanding, academic abstract segmentation and large language model semantic learning. It can be employed to train and evaluate models' abilities in multi-angle semantic recognition, representation learning and fine-grained retrieval for scientific literature. For instance, it can act as the core training corpus for tasks such as contrastive learning, automatic summarization, literature classification, information extraction, and question-document retrieval, providing data support for constructing language models with academic semantic understanding capabilities.
提供机构:
北京航空航天大学
搜集汇总
数据集介绍

背景与挑战
背景概述
该数据集包含10万条从ArXiv获取的英文论文摘要,通过程序处理生成了包含文献标题、摘要及背景、方法和结果三个角度的细粒度摘要。它适用于科研文献理解、自动摘要、信息抽取等任务,为训练和评估语言模型的学术语义理解能力提供数据支持。
以上内容由遇见数据集搜集并总结生成



