ArXiV Archive
收藏NIAID Data Ecosystem2026-03-10 收录
下载链接:
http://datadryad.org/dataset/doi%253A10.6078%252FD1708G
下载链接
链接失效反馈官方服务:
资源简介:
This is a full archive of metadata about papers on arxiv.org from 1993-2018, including abstracts. Data is tidy and packed in TSV files, in two different collections of the total dataset: per year (all categories) and per primary category (all years). This archive also includes Jupyter notebooks for unpacking and analyzing it in python. See the README.md file and https://github.com/staeiou/arxiv_archive for more information.
Methods
Step 0: Query from arxiv.org
Arxiv's main permitted means of bulk downloading article metadata is through its OAI-PMH API. I used the oai-harvest program to download this, which stores the records in one XML file per paper, for a total of about 1.4 million files. These files are too large to be uploaded here.
Step 1: Process XML files
In the Jupyter notebook 1-process-xml-files.ipynb, the individual XML files are processed into a single large Pandas DataFrame, which is stored in TSV and pickle formats. These files are too large to be uploaded here.
Step 2: Process categories and output to per_year and per_category TSVs
In the Jupyter notebook 2-process-categories-out.ipynb, the large TSV file created in step 1 is parsed and separated into two different batched outputs. The processed_data/per_year folder contains one TSV file per year, compressed in .zip format. The processed_data/per_category contains one TSV file per Arxiv category, compressed in .xz format. Arxiv papers have primary and secondary categories (posting and cross-posting), and papers are in a category's dataset if they were either posted or cross-posted to that category.
Step 3: Export raw titles and abstracts
In the Jupyter notebook 3-abstracts-export.ipynb, the per_year datasets are unpacked and merged, then two sets of files are created for 1) just abstracts and 2) just titles, with one title or abstract per line. This creates zipped files for all items (too large to upload on GitHub) and a random sample of 250k items, which can be found in processed_data/DUMP_DATE/arxiv-abstracts-250k.txt.zip and processed_data/DUMP_DATE/arxiv-titles-250k.txt.zip.
创建时间:
2019-01-04



