arXiv Papers Dataset
收藏arXiv2025-09-30 收录
下载链接:
https://info.arxiv.org/help/bulk_data_s3.html
下载链接
链接失效反馈官方服务:
资源简介:
该数据集包含了从PDF文档中提取的、类似于定理环境和证明的arXiv论文,并根据它们的LaTeX源代码进行了注释。该数据集的特点是包含了定理环境和证明的地面真实标签,还设计了一个用于注解的LaTeX软件包。此外,验证数据集与训练数据集是不重叠的。该数据集的规模大约有来自3,682篇PDF文章的50万个样本,其任务是针对学术文章中的基本段落、定理段落和证明段落进行多模态分类。
This dataset consists of excerpts resembling theorem environments and proofs extracted from PDF documents of arXiv papers, annotated using their corresponding LaTeX source code. A key attribute of this dataset is that it provides ground truth labels for theorem environments and proofs, and a specialized LaTeX package for annotation was developed to support dataset curation. Additionally, the validation and training splits are non-overlapping. The dataset contains approximately 500,000 samples sourced from 3,682 PDF articles, with the target task being multi-modal classification of basic paragraphs, theorem paragraphs and proof paragraphs within academic articles.
提供机构:
arXiv



