five

arXiv Papers Dataset

收藏
arXiv2025-09-30 收录
下载链接:
https://info.arxiv.org/help/bulk_data_s3.html
下载链接
链接失效反馈
官方服务:
资源简介:
该数据集包含了从PDF文档中提取的、类似于定理环境和证明的arXiv论文,并根据它们的LaTeX源代码进行了注释。该数据集的特点是包含了定理环境和证明的地面真实标签,还设计了一个用于注解的LaTeX软件包。此外,验证数据集与训练数据集是不重叠的。该数据集的规模大约有来自3,682篇PDF文章的50万个样本,其任务是针对学术文章中的基本段落、定理段落和证明段落进行多模态分类。

This dataset consists of excerpts resembling theorem environments and proofs extracted from PDF documents of arXiv papers, annotated using their corresponding LaTeX source code. A key attribute of this dataset is that it provides ground truth labels for theorem environments and proofs, and a specialized LaTeX package for annotation was developed to support dataset curation. Additionally, the validation and training splits are non-overlapping. The dataset contains approximately 500,000 samples sourced from 3,682 PDF articles, with the target task being multi-modal classification of basic paragraphs, theorem paragraphs and proof paragraphs within academic articles.
提供机构:
arXiv
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作