unarXive

Name: unarXive
Creator: OpenDataLab
Published: 2026-05-24 07:30:28
License: 暂无描述

OpenDataLab2026-05-24 更新2024-05-09 收录

下载链接：

https://opendatalab.org.cn/OpenDataLab/unarXive

下载链接

链接失效反馈

官方服务：

资源简介：

包含出版物全文、带注释的文本引用和元数据链接的学术数据集。 unarXive 数据集包含 100 万篇纯文本论文 6300 万引文上下文 3900 万参考字符串 1600 万个连接的引文网络数据来自 1991 年至 2020/07 年期间 arXiv 上的所有 LaTeX 源，因此质量高于生成的数据从 PDF 文件。此外，由于所有施引论文均以全文形式提供，因此可以提取任意大小的引文上下文。数据集的典型用途是引文推荐中的方法引文上下文分析参考字符串解析生成数据集的代码是公开的。

This is an academic dataset containing full publication texts, annotated textual citations and metadata links. The unarXive dataset comprises 1 million full-text papers, 63 million citation contexts, 39 million reference strings, and 16 million connected citation networks. The data is collected from all LaTeX sources hosted on arXiv between 1991 and July 2020, rendering its quality superior to that of data extracted from PDF files. Furthermore, as all citing papers are provided in full text, citation contexts of arbitrary sizes can be extracted. Typical use cases of this dataset include developing methods for citation recommendation, citation context analysis, and reference string parsing. The code for generating this dataset is publicly available.

提供机构：

OpenDataLab

创建时间：

2022-08-19

搜集汇总

数据集介绍