five

ArXiv OAI-PMH arXivRaw publication metadata

收藏
NIAID Data Ecosystem2026-05-01 收录
下载链接:
https://zenodo.org/record/11065281
下载链接
链接失效反馈
官方服务:
资源简介:
This dataset contains OAI-PMH metadata for all ArXiv publications up until 2024-04-23 in the arXivRaw XML format. The metadata has been harvested using the metha Go package v0.3.3 [1] on go1.18. Specifically, harvesting was run on a small HPC cluster using the following SLURM script. The script had to be scheduled twice due to the connection being reset by the peer (see combined-slurm.out). metha caters for these situations and is able to pick up where it left off with cumulative harvesting. #!/bin/bash #SBATCH --job-name=metha #SBATCH --nodes=1 #SBATCH --cpus-per-task=1 #SBATCH --ntasks=1 #SBATCH --time=10-20:00:00 module purge echo "Installing Go module." module add go/go-1.18/go-1.18-gcc-9.4.0-okbjyoy echo "Installed Go module: $(go version)." echo "Installing metha." go install -v github.com/miku/metha/cmd/...@latest echo "Installed metha: $(/go/bin/metha-sync -v)" echo "Harvesting ArXiv OAI-PMH metadata in format 'arXivRaw' from http://export.arxiv.org/oai2." /go/bin/metha-sync -T 5m -base-dir /scratch//arxiv -format "arXivRaw" http://export.arxiv.org/oai2 # For the second run, '-from' was specified to pick up the harvest where it was left off. # /go/bin/metha-sync -from 2020-09-29 -T 5m -base-dir /scratch//arxiv -format "arXivRaw" http://export.arxiv.org/oai2 echo "Done." exit 0 Dataset contents This deposit of the dataset contains the following files: metha-output-OAI-PMH-arXivRaw-until-2024-03-24.tar.gz: an archive file containing the archive files (gzipped, *.xml.gz) produced by metha, which in turn contain the XML metadata files. The gzipped files contained in the archive are named following the pattern YYYY-MM-DD-<8-digit zero-padded 0-index file count>.xml.gz, e.g., 2024-03-24-00000001.xml.gz. README.md: This file, containing basic information about the dataset and deposit. combined-slurm.out: The combined SLURM log for the two consecutive SLURM runs that have produced the dataset. Run-specific information has been retracted. Reproducibility As the OAI-PMH metadata is not static but may change at any time, this dataset isn't fully reproducible. However, running the same metha version on the same go version with the same commands should yield very similar results, but will contain newer metadata. Licenses All ArXiv OAI-PMH metadata is licensed under CC0-1.0. combined-slurm.out is licensed under CC0-1.0. README.md is licensed under CC0-1.0. Licenses are documented in a machine-readble manner following the REUSE 3.0 Specification. License deeds are included in this deposit as .txt files named using the respective SPDX license identifiers. [1] Martin Czygan, Thomas Gersch, ACz-UniBi, Justin Kelly, Gunnar Þór Magnússon, dvglc, & Natanael Arndt. (2024). miku/metha: v0.3.3 (v0.3.3). Zenodo. doi:10.5281/zenodo.10940212.
创建时间:
2024-04-25
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作