open-index/open-arxiv
收藏Hugging Face2026-03-24 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/open-index/open-arxiv
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc0-1.0
task_categories:
- text-classification
- text-generation
- feature-extraction
- question-answering
language:
- en
- multilingual
pretty_name: Open arXiv
size_categories:
- 1M<n<10M
tags:
- arxiv
- research-papers
- scientific-papers
- metadata
- academic
- citations
- open-access
source_datasets:
- Cornell-University/arxiv
configs:
- config_name: default
data_files:
- split: train
path: data/*/*.parquet
---
# Open arXiv
> Every arXiv paper's metadata in one place: search, filter, and explore 40 years of science
## What is it?
**Open arXiv** is the complete [arXiv](https://arxiv.org) metadata dataset, covering titles, abstracts, authors, categories, DOIs, version history, and more. It is converted from the [Cornell University Kaggle dataset](https://www.kaggle.com/datasets/Cornell-University/arxiv) into Parquet format for efficient querying and streaming.
The dataset contains **2.99M papers** spanning from 1991 to 2026, packaged into **417 Parquet shards** (Zstd compressed, **1.4 GB** total). All 14 fields from the original dataset are preserved with no modifications to the data content.
Released under the **CC0 1.0 Universal (Public Domain Dedication)** license, the same license as the original dataset.
## Dataset at a Glance
| | |
|:---|:---|
| **Papers** | 2.99M |
| **Time span** | 1991 to 2026 |
| **Has DOI** | 43.4% |
| **Revised at least once** | 39.7% |
| **Average versions** | 1.6 |
| **Cross-listed in multiple categories** | 47.7% |
| **Parquet shards** | 417 |
| **Total size** | 1.4 GB |
## The Growth of arXiv
arXiv started in 1991 as a small physics preprint server at Los Alamos National Laboratory. Today it spans dozens of disciplines. Here is every year of submissions, from the first papers to hundreds of thousands per year:
```
1991 █ 306
1992 █ 3.3K
1993 █ 6.7K
1994 █ 10.1K
1995 █ 13.0K
1996 ██ 15.9K
1997 ██ 19.6K
1998 ███ 24.2K
1999 ███ 27.7K
2000 ████ 30.6K
2001 ████ 33.2K
2002 █████ 36.1K
2003 █████ 39.4K
2004 ██████ 43.7K
2005 ██████ 46.8K
2006 ███████ 50.2K
2007 ███████ 55.6K
2008 ████████ 58.9K
2009 █████████ 64.0K
2010 █████████ 70.1K
2011 ██████████ 76.6K
2012 ███████████ 84.6K
2013 █████████████ 92.6K
2014 █████████████ 97.5K
2015 ██████████████ 105.3K
2016 ███████████████ 113.4K
2017 █████████████████ 123.5K
2018 ███████████████████ 140.6K
2019 █████████████████████ 155.9K
2020 █████████████████████████ 178.3K
2021 █████████████████████████ 181.6K
2022 ██████████████████████████ 185.7K
2023 █████████████████████████████ 208.5K
2024 ██████████████████████████████████ 244.0K
2025 ████████████████████████████████████████ 284.5K
2026 █████████ 66.8K
```
<details>
<summary>SQL query to reproduce this chart</summary>
```sql
SELECT CASE
WHEN POSITION('/' IN id) > 0 THEN
CASE WHEN CAST(SUBSTR(SPLIT_PART(id, '/', 2), 1, 2) AS INT) >= 91
THEN 1900 + CAST(SUBSTR(SPLIT_PART(id, '/', 2), 1, 2) AS INT)
ELSE 2000 + CAST(SUBSTR(SPLIT_PART(id, '/', 2), 1, 2) AS INT)
END
ELSE 2000 + CAST(SUBSTR(id, 1, 2) AS INT)
END AS submission_year,
COUNT(*) AS papers
FROM read_parquet('hf://datasets/open-index/open-arxiv/data/*/*.parquet')
GROUP BY submission_year
ORDER BY submission_year;
```
</details>
## What People Write About
The top 20 primary categories, showing where open science is thriving:
| # | Category | Papers | |
|---|----------|-------:|---|
| 1 | `cs.CV` | 143.9K | ████████████████████ |
| 2 | `hep-ph` | 141.6K | ███████████████████░ |
| 3 | `cs.LG` | 128.5K | █████████████████░░░ |
| 4 | `quant-ph` | 128.5K | █████████████████░░░ |
| 5 | `hep-th` | 112.2K | ███████████████░░░░░ |
| 6 | `astro-ph` | 94.2K | █████████████░░░░░░░ |
| 7 | `cs.CL` | 78.1K | ██████████░░░░░░░░░░ |
| 8 | `gr-qc` | 70.7K | █████████░░░░░░░░░░░ |
| 9 | `cond-mat.mtrl-sci` | 69.7K | █████████░░░░░░░░░░░ |
| 10 | `cond-mat.mes-hall` | 68.7K | █████████░░░░░░░░░░░ |
| 11 | `math.AP` | 56.6K | ███████░░░░░░░░░░░░░ |
| 12 | `astro-ph.GA` | 53.9K | ███████░░░░░░░░░░░░░ |
| 13 | `math.CO` | 52.3K | ███████░░░░░░░░░░░░░ |
| 14 | `cond-mat.str-el` | 52.2K | ███████░░░░░░░░░░░░░ |
| 15 | `astro-ph.SR` | 47.7K | ██████░░░░░░░░░░░░░░ |
| 16 | `astro-ph.HE` | 45.4K | ██████░░░░░░░░░░░░░░ |
| 17 | `astro-ph.CO` | 44.4K | ██████░░░░░░░░░░░░░░ |
| 18 | `math.PR` | 43.4K | ██████░░░░░░░░░░░░░░ |
| 19 | `cond-mat.stat-mech` | 43.3K | ██████░░░░░░░░░░░░░░ |
| 20 | `math.AG` | 39.6K | █████░░░░░░░░░░░░░░░ |
<details>
<summary>SQL query to reproduce this table</summary>
```sql
SELECT SPLIT_PART(categories, ' ', 1) AS primary_cat,
COUNT(*) AS papers
FROM read_parquet('hf://datasets/open-index/open-arxiv/data/*/*.parquet')
GROUP BY primary_cat
ORDER BY papers DESC
LIMIT 20;
```
</details>
And grouped by top-level discipline:
| # | Category | Papers | |
|---|----------|-------:|---|
| 1 | `Computer Science` | 736.8K | ████████████████████ |
| 2 | `Mathematics` | 571.1K | ███████████████░░░░░ |
| 3 | `Condensed Matter` | 344.9K | █████████░░░░░░░░░░░ |
| 4 | `Astrophysics` | 332.3K | █████████░░░░░░░░░░░ |
| 5 | `Physics (general)` | 207.5K | █████░░░░░░░░░░░░░░░ |
| 6 | `High Energy Physics (ph)` | 141.6K | ███░░░░░░░░░░░░░░░░░ |
| 7 | `Quantum Physics` | 128.5K | ███░░░░░░░░░░░░░░░░░ |
| 8 | `High Energy Physics (th)` | 112.2K | ███░░░░░░░░░░░░░░░░░ |
| 9 | `General Relativity` | 70.7K | █░░░░░░░░░░░░░░░░░░░ |
| 10 | `Electrical Engineering` | 70.5K | █░░░░░░░░░░░░░░░░░░░ |
| 11 | `Statistics` | 60.7K | █░░░░░░░░░░░░░░░░░░░ |
| 12 | `Nuclear Physics (th)` | 35.4K | █░░░░░░░░░░░░░░░░░░░ |
| 13 | `Mathematical Physics` | 33.9K | █░░░░░░░░░░░░░░░░░░░ |
| 14 | `Quantitative Biology` | 33.7K | █░░░░░░░░░░░░░░░░░░░ |
| 15 | `High Energy Physics (exp)` | 24.8K | █░░░░░░░░░░░░░░░░░░░ |
<details>
<summary>SQL query to reproduce this table</summary>
```sql
SELECT CASE
WHEN POSITION('.' IN SPLIT_PART(categories, ' ', 1)) > 0
THEN SPLIT_PART(SPLIT_PART(categories, ' ', 1), '.', 1)
ELSE SPLIT_PART(categories, ' ', 1)
END AS area,
COUNT(*) AS papers
FROM read_parquet('hf://datasets/open-index/open-arxiv/data/*/*.parquet')
GROUP BY area
ORDER BY papers DESC
LIMIT 15;
```
</details>
47.7% of papers are cross-listed in more than one category.
## Field Coverage
Not every paper fills in every metadata field. Here is how complete each optional field is across the full dataset:
| Field | Coverage | |
|-------|-------:|---|
| `submitter` | 99.5% | ██████████████░ |
| `authors` | 100.0% | ███████████████ |
| `comments` | 72.6% | ██████████░░░░░ |
| `journal_ref` | 31.4% | ████░░░░░░░░░░░ |
| `doi` | 43.4% | ██████░░░░░░░░░ |
| `license` | 84.9% | ████████████░░░ |
| `report_no` | 6.4% | █░░░░░░░░░░░░░░ |
<details>
<summary>SQL query to reproduce this table</summary>
```sql
SELECT
COUNT(*) AS total,
ROUND(100.0 * SUM(CASE WHEN submitter IS NOT NULL AND submitter != '' THEN 1 ELSE 0 END) / COUNT(*), 1) AS submitter_pct,
ROUND(100.0 * SUM(CASE WHEN authors IS NOT NULL AND authors != '' THEN 1 ELSE 0 END) / COUNT(*), 1) AS authors_pct,
ROUND(100.0 * SUM(CASE WHEN comments IS NOT NULL AND comments != '' THEN 1 ELSE 0 END) / COUNT(*), 1) AS comments_pct,
ROUND(100.0 * SUM(CASE WHEN journal_ref IS NOT NULL AND journal_ref != '' THEN 1 ELSE 0 END) / COUNT(*), 1) AS journal_ref_pct,
ROUND(100.0 * SUM(CASE WHEN doi IS NOT NULL AND doi != '' THEN 1 ELSE 0 END) / COUNT(*), 1) AS doi_pct,
ROUND(100.0 * SUM(CASE WHEN license IS NOT NULL AND license != '' THEN 1 ELSE 0 END) / COUNT(*), 1) AS license_pct,
ROUND(100.0 * SUM(CASE WHEN report_no IS NOT NULL AND report_no != '' THEN 1 ELSE 0 END) / COUNT(*), 1) AS report_no_pct
FROM read_parquet('hf://datasets/open-index/open-arxiv/data/*/*.parquet');
```
</details>
The required fields (`id`, `title`, `categories`, `abstract`, `versions`, `update_date`, `authors_parsed`) are present for every paper.
## Versions and Revisions
Most papers start at v1 and stay there, but a good number get updates. 39.7% of all papers have been revised at least once. On average, papers have **1.6** versions, and the most-revised paper reached **v187**.
<details>
<summary>SQL query to reproduce these stats</summary>
```sql
SELECT
ROUND(AVG(json_array_length(versions)), 2) AS avg_versions,
MAX(json_array_length(versions)) AS max_versions,
ROUND(100.0 * SUM(CASE WHEN json_array_length(versions) > 1 THEN 1 ELSE 0 END) / COUNT(*), 1) AS revised_pct
FROM read_parquet('hf://datasets/open-index/open-arxiv/data/*/*.parquet');
```
</details>
## Licenses
arXiv papers carry their own license terms. Here is the breakdown:
| License | Papers | Share |
|---------|-------:|------:|
| arXiv non-exclusive | 1.83M | 61.3% |
| CC BY | 502.2K | 16.8% |
| Not specified | 452.8K | 15.1% |
| CC BY-NC-ND | 78.9K | 2.6% |
| CC BY-NC-SA | 59.3K | 2.0% |
| CC BY-SA | 27.2K | 0.9% |
| CC0 (Public Domain) | 20.0K | 0.7% |
| CC BY | 7.9K | 0.3% |
| CC BY-NC-SA | 5.9K | 0.2% |
| creativecommons.org/licenses/publicdo... | 2.5K | 0.1% |
<details>
<summary>SQL query to reproduce this table</summary>
```sql
SELECT COALESCE(license, '') AS license,
COUNT(*) AS papers
FROM read_parquet('hf://datasets/open-index/open-arxiv/data/*/*.parquet')
GROUP BY license
ORDER BY papers DESC;
```
</details>
## What is being released?
Every arXiv paper's metadata is stored as one row in the Parquet files, organized by submission month:
```
data/
1991/
1991-01.parquet
1991-02.parquet
...
2007/
2007-04.parquet
...
2026/
2026-01.parquet
2026-02.parquet
2026-03.parquet
```
Each shard contains all papers submitted in that month. This layout makes incremental updates efficient: when arXiv is updated weekly, only the current month's shard changes while all historical months stay the same.
Each row includes the paper ID, title, abstract, authors (both raw and parsed), categories, DOI, journal reference, version history, and more. The `versions` and `authors_parsed` fields are stored as JSON strings for maximum compatibility.
## How to download and use Open arXiv
### Using `datasets`
```python
from datasets import load_dataset
# Stream the entire dataset
ds = load_dataset("open-index/open-arxiv", split="train", streaming=True)
for paper in ds:
print(paper["id"], paper["title"])
# Load into memory
ds = load_dataset("open-index/open-arxiv", split="train")
print(f"{len(ds):,} papers loaded")
```
### Using `huggingface_hub`
```python
from huggingface_hub import snapshot_download
folder = snapshot_download(
"open-index/open-arxiv",
repo_type="dataset",
local_dir="./open-arxiv/",
)
```
For faster downloads, install `pip install huggingface_hub[hf_transfer]` and set `HF_HUB_ENABLE_HF_TRANSFER=1`.
### Using DuckDB
```sql
-- Count papers per top-level category
SELECT split_part(categories, '.', 1) AS area,
COUNT(*) AS papers
FROM read_parquet('hf://datasets/open-index/open-arxiv/data/*/*.parquet')
GROUP BY area
ORDER BY papers DESC
LIMIT 20;
```
### Working with JSON fields
```python
import json
from datasets import load_dataset
ds = load_dataset("open-index/open-arxiv", split="train", streaming=True)
for paper in ds:
# Parse structured authors
authors = json.loads(paper["authors_parsed"])
for last, first, suffix in authors:
print(f" {first} {last}")
# Parse version history
versions = json.loads(paper["versions"])
print(f" First submitted: {versions[0]['created']}")
print(f" Latest version: v{len(versions)}")
```
# Dataset card for Open arXiv
## Dataset Description
- **Homepage:** [https://huggingface.co/datasets/open-index/open-arxiv](https://huggingface.co/datasets/open-index/open-arxiv)
- **Source:** [Cornell-University/arxiv on Kaggle](https://www.kaggle.com/datasets/Cornell-University/arxiv)
- **Point of Contact:** please create a discussion on the Community tab
- **License:** CC0 1.0 Universal (Public Domain Dedication)
## Dataset Structure
### Data Instance
```json
{
"id": "0704.0001",
"submitter": "Pavel Nadolsky",
"authors": "C. Balazs, E. L. Berger, P. M. Nadolsky, C.-P. Yuan",
"title": "Calculation of prompt diphoton production cross sections at Tevatron and LHC energies",
"comments": "37 pages, 15 figures; published version",
"journal_ref": "Phys.Rev.D76:013009,2007",
"doi": "10.1103/PhysRevD.76.013009",
"report_no": "ANL-HEP-PR-07-12",
"categories": "hep-ph",
"license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
"abstract": " A fully differential calculation in perturbative quantum chromodynamics is presented for the production of massive photon pairs...",
"versions": "[{\"version\": \"v1\", \"created\": \"Mon, 2 Apr 2007 19:18:42 GMT\"}, {\"version\": \"v2\", \"created\": \"Tue, 24 Jul 2007 20:10:27 GMT\"}]",
"update_date": "2008-11-26",
"authors_parsed": "[[\"Balazs\", \"C.\", \"\"], [\"Berger\", \"E. L.\", \"\"], [\"Nadolsky\", \"P. M.\", \"\"], [\"Yuan\", \"C. -P.\", \"\"]]"
}
```
### Data Fields
| Column | Type | Description |
|--------|------|-------------|
| `id` | string | arXiv paper ID (e.g. `"0704.0001"` or `"2301.12345"`) |
| `submitter` | string (nullable) | Person who submitted the paper |
| `authors` | string (nullable) | Raw author string exactly as submitted |
| `title` | string | Paper title |
| `comments` | string (nullable) | Free-form comments (e.g. "37 pages, 15 figures; published version") |
| `journal_ref` | string (nullable) | Journal citation (e.g. "Phys.Rev.D76:013009,2007") |
| `doi` | string (nullable) | Digital Object Identifier |
| `report_no` | string (nullable) | Institutional report number |
| `categories` | string | Space-separated arXiv categories (e.g. `"hep-ph"` or `"math.CO cs.CG"`) |
| `license` | string (nullable) | License URL for the paper content |
| `abstract` | string | Paper abstract |
| `versions` | string (JSON) | Version history array: `[{"version": "v1", "created": "..."}]` |
| `update_date` | string | Last metadata update date (YYYY-MM-DD) |
| `authors_parsed` | string (JSON) | Structured authors: `[["LastName", "FirstName", "Suffix"]]` |
### Data Splits
The dataset has a single `train` split containing all 2.99M papers.
## Dataset Creation
### Curation Rationale
The original arXiv metadata on Kaggle is distributed as a single large JSONL file (about 4-5 GB), which is inconvenient for streaming, SQL queries, and integration with modern ML tooling. **Open arXiv** converts this to columnar Parquet format with Zstd compression, enabling:
- **Streaming** via HuggingFace `datasets` without downloading the full file
- **SQL queries** via DuckDB directly from HuggingFace URLs
- **Column pruning** to load only the fields you need (e.g. just titles and abstracts)
- **Efficient filtering** by category, date, or any other field
### Source Data
The source is the [Cornell University arXiv dataset](https://www.kaggle.com/datasets/Cornell-University/arxiv) on Kaggle, which provides metadata for all papers on [arXiv.org](https://arxiv.org). The Kaggle dataset is updated weekly by Cornell University.
### Data Processing Steps
1. **Download** the dataset ZIP from Kaggle (about 1.7 GB compressed)
2. **Extract** the JSONL file (`arxiv-metadata-oai-snapshot.json`, about 4-5 GB)
3. **Convert** to Parquet shards using a streaming Go pipeline, where each line is parsed with `gjson` for zero-allocation field extraction, then written to Zstd-compressed Parquet via `parquet-go`
4. **Analyze** the parquet shards with DuckDB to compute dataset statistics
5. **Upload** to HuggingFace via the `huggingface_hub` xet-aware uploader
### Field Name Changes
The original Kaggle dataset uses hyphenated field names (`journal-ref`, `report-no`). These are converted to underscores (`journal_ref`, `report_no`) for compatibility with Parquet column naming conventions and Python attribute access.
### Complex Fields
The `versions` and `authors_parsed` fields contain nested structures (arrays of objects/arrays) that cannot be directly represented in flat Parquet columns. They are stored as JSON strings. Use `json.loads()` in Python to parse them.
## Considerations for Using the Data
### Social Impact
By converting arXiv metadata to an accessible columnar format, we aim to lower the barrier for scientific text mining, citation analysis, and research trend studies. The dataset enables researchers to explore the full history of arXiv without needing Kaggle credentials or parsing large JSON files.
### Known Limitations
- This dataset contains **metadata only**. Paper PDFs and source files are not included. For full-text access, see [arXiv on Google Cloud Storage](https://cloud.google.com/storage/docs/public-datasets/arxiv).
- The `abstract` field may contain leading whitespace and LaTeX notation from the original submissions.
- Author disambiguation is not performed. Use the `authors_parsed` field for structured name access.
## Additional Information
### Licensing
The dataset is released under **CC0 1.0 Universal (Public Domain Dedication)**, the same license as the [original Kaggle dataset](https://www.kaggle.com/datasets/Cornell-University/arxiv). Individual papers on arXiv have their own licenses, see the `license` field in each row.
### Contact
Please open a discussion on the [Community tab](https://huggingface.co/datasets/open-index/open-arxiv/discussions) for questions, feedback, or issues.
### Last Updated
2026-03-24
提供机构:
open-index



