open-index/open-arxiv

Name: open-index/open-arxiv
Creator: open-index
Published: 2026-03-24 04:18:12
License: 暂无描述

Hugging Face2026-03-24 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/open-index/open-arxiv

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc0-1.0 task_categories: - text-classification - text-generation - feature-extraction - question-answering language: - en - multilingual pretty_name: Open arXiv size_categories: - 1M<n<10M tags: - arxiv - research-papers - scientific-papers - metadata - academic - citations - open-access source_datasets: - Cornell-University/arxiv configs: - config_name: default data_files: - split: train path: data/*/*.parquet --- # Open arXiv > Every arXiv paper's metadata in one place: search, filter, and explore 40 years of science ## What is it? **Open arXiv** is the complete [arXiv](https://arxiv.org) metadata dataset, covering titles, abstracts, authors, categories, DOIs, version history, and more. It is converted from the [Cornell University Kaggle dataset](https://www.kaggle.com/datasets/Cornell-University/arxiv) into Parquet format for efficient querying and streaming. The dataset contains **2.99M papers** spanning from 1991 to 2026, packaged into **417 Parquet shards** (Zstd compressed, **1.4 GB** total). All 14 fields from the original dataset are preserved with no modifications to the data content. Released under the **CC0 1.0 Universal (Public Domain Dedication)** license, the same license as the original dataset. ## Dataset at a Glance | | | |:---|:---| | **Papers** | 2.99M | | **Time span** | 1991 to 2026 | | **Has DOI** | 43.4% | | **Revised at least once** | 39.7% | | **Average versions** | 1.6 | | **Cross-listed in multiple categories** | 47.7% | | **Parquet shards** | 417 | | **Total size** | 1.4 GB | ## The Growth of arXiv arXiv started in 1991 as a small physics preprint server at Los Alamos National Laboratory. Today it spans dozens of disciplines. Here is every year of submissions, from the first papers to hundreds of thousands per year: ``` 1991 █ 306 1992 █ 3.3K 1993 █ 6.7K 1994 █ 10.1K 1995 █ 13.0K 1996 ██ 15.9K 1997 ██ 19.6K 1998 ███ 24.2K 1999 ███ 27.7K 2000 ████ 30.6K 2001 ████ 33.2K 2002 █████ 36.1K 2003 █████ 39.4K 2004 ██████ 43.7K 2005 ██████ 46.8K 2006 ███████ 50.2K 2007 ███████ 55.6K 2008 ████████ 58.9K 2009 █████████ 64.0K 2010 █████████ 70.1K 2011 ██████████ 76.6K 2012 ███████████ 84.6K 2013 █████████████ 92.6K 2014 █████████████ 97.5K 2015 ██████████████ 105.3K 2016 ███████████████ 113.4K 2017 █████████████████ 123.5K 2018 ███████████████████ 140.6K 2019 █████████████████████ 155.9K 2020 █████████████████████████ 178.3K 2021 █████████████████████████ 181.6K 2022 ██████████████████████████ 185.7K 2023 █████████████████████████████ 208.5K 2024 ██████████████████████████████████ 244.0K 2025 ████████████████████████████████████████ 284.5K 2026 █████████ 66.8K ``` <details> <summary>SQL query to reproduce this chart</summary> ```sql SELECT CASE WHEN POSITION('/' IN id) > 0 THEN CASE WHEN CAST(SUBSTR(SPLIT_PART(id, '/', 2), 1, 2) AS INT) >= 91 THEN 1900 + CAST(SUBSTR(SPLIT_PART(id, '/', 2), 1, 2) AS INT) ELSE 2000 + CAST(SUBSTR(SPLIT_PART(id, '/', 2), 1, 2) AS INT) END ELSE 2000 + CAST(SUBSTR(id, 1, 2) AS INT) END AS submission_year, COUNT(*) AS papers FROM read_parquet('hf://datasets/open-index/open-arxiv/data/*/*.parquet') GROUP BY submission_year ORDER BY submission_year; ``` </details> ## What People Write About The top 20 primary categories, showing where open science is thriving: | # | Category | Papers | | |---|----------|-------:|---| | 1 | `cs.CV` | 143.9K | ████████████████████ | | 2 | `hep-ph` | 141.6K | ███████████████████░ | | 3 | `cs.LG` | 128.5K | █████████████████░░░ | | 4 | `quant-ph` | 128.5K | █████████████████░░░ | | 5 | `hep-th` | 112.2K | ███████████████░░░░░ | | 6 | `astro-ph` | 94.2K | █████████████░░░░░░░ | | 7 | `cs.CL` | 78.1K | ██████████░░░░░░░░░░ | | 8 | `gr-qc` | 70.7K | █████████░░░░░░░░░░░ | | 9 | `cond-mat.mtrl-sci` | 69.7K | █████████░░░░░░░░░░░ | | 10 | `cond-mat.mes-hall` | 68.7K | █████████░░░░░░░░░░░ | | 11 | `math.AP` | 56.6K | ███████░░░░░░░░░░░░░ | | 12 | `astro-ph.GA` | 53.9K | ███████░░░░░░░░░░░░░ | | 13 | `math.CO` | 52.3K | ███████░░░░░░░░░░░░░ | | 14 | `cond-mat.str-el` | 52.2K | ███████░░░░░░░░░░░░░ | | 15 | `astro-ph.SR` | 47.7K | ██████░░░░░░░░░░░░░░ | | 16 | `astro-ph.HE` | 45.4K | ██████░░░░░░░░░░░░░░ | | 17 | `astro-ph.CO` | 44.4K | ██████░░░░░░░░░░░░░░ | | 18 | `math.PR` | 43.4K | ██████░░░░░░░░░░░░░░ | | 19 | `cond-mat.stat-mech` | 43.3K | ██████░░░░░░░░░░░░░░ | | 20 | `math.AG` | 39.6K | █████░░░░░░░░░░░░░░░ | <details> <summary>SQL query to reproduce this table</summary> ```sql SELECT SPLIT_PART(categories, ' ', 1) AS primary_cat, COUNT(*) AS papers FROM read_parquet('hf://datasets/open-index/open-arxiv/data/*/*.parquet') GROUP BY primary_cat ORDER BY papers DESC LIMIT 20; ``` </details> And grouped by top-level discipline: | # | Category | Papers | | |---|----------|-------:|---| | 1 | `Computer Science` | 736.8K | ████████████████████ | | 2 | `Mathematics` | 571.1K | ███████████████░░░░░ | | 3 | `Condensed Matter` | 344.9K | █████████░░░░░░░░░░░ | | 4 | `Astrophysics` | 332.3K | █████████░░░░░░░░░░░ | | 5 | `Physics (general)` | 207.5K | █████░░░░░░░░░░░░░░░ | | 6 | `High Energy Physics (ph)` | 141.6K | ███░░░░░░░░░░░░░░░░░ | | 7 | `Quantum Physics` | 128.5K | ███░░░░░░░░░░░░░░░░░ | | 8 | `High Energy Physics (th)` | 112.2K | ███░░░░░░░░░░░░░░░░░ | | 9 | `General Relativity` | 70.7K | █░░░░░░░░░░░░░░░░░░░ | | 10 | `Electrical Engineering` | 70.5K | █░░░░░░░░░░░░░░░░░░░ | | 11 | `Statistics` | 60.7K | █░░░░░░░░░░░░░░░░░░░ | | 12 | `Nuclear Physics (th)` | 35.4K | █░░░░░░░░░░░░░░░░░░░ | | 13 | `Mathematical Physics` | 33.9K | █░░░░░░░░░░░░░░░░░░░ | | 14 | `Quantitative Biology` | 33.7K | █░░░░░░░░░░░░░░░░░░░ | | 15 | `High Energy Physics (exp)` | 24.8K | █░░░░░░░░░░░░░░░░░░░ | <details> <summary>SQL query to reproduce this table</summary> ```sql SELECT CASE WHEN POSITION('.' IN SPLIT_PART(categories, ' ', 1)) > 0 THEN SPLIT_PART(SPLIT_PART(categories, ' ', 1), '.', 1) ELSE SPLIT_PART(categories, ' ', 1) END AS area, COUNT(*) AS papers FROM read_parquet('hf://datasets/open-index/open-arxiv/data/*/*.parquet') GROUP BY area ORDER BY papers DESC LIMIT 15; ``` </details> 47.7% of papers are cross-listed in more than one category. ## Field Coverage Not every paper fills in every metadata field. Here is how complete each optional field is across the full dataset: | Field | Coverage | | |-------|-------:|---| | `submitter` | 99.5% | ██████████████░ | | `authors` | 100.0% | ███████████████ | | `comments` | 72.6% | ██████████░░░░░ | | `journal_ref` | 31.4% | ████░░░░░░░░░░░ | | `doi` | 43.4% | ██████░░░░░░░░░ | | `license` | 84.9% | ████████████░░░ | | `report_no` | 6.4% | █░░░░░░░░░░░░░░ | <details> <summary>SQL query to reproduce this table</summary> ```sql SELECT COUNT(*) AS total, ROUND(100.0 * SUM(CASE WHEN submitter IS NOT NULL AND submitter != '' THEN 1 ELSE 0 END) / COUNT(*), 1) AS submitter_pct, ROUND(100.0 * SUM(CASE WHEN authors IS NOT NULL AND authors != '' THEN 1 ELSE 0 END) / COUNT(*), 1) AS authors_pct, ROUND(100.0 * SUM(CASE WHEN comments IS NOT NULL AND comments != '' THEN 1 ELSE 0 END) / COUNT(*), 1) AS comments_pct, ROUND(100.0 * SUM(CASE WHEN journal_ref IS NOT NULL AND journal_ref != '' THEN 1 ELSE 0 END) / COUNT(*), 1) AS journal_ref_pct, ROUND(100.0 * SUM(CASE WHEN doi IS NOT NULL AND doi != '' THEN 1 ELSE 0 END) / COUNT(*), 1) AS doi_pct, ROUND(100.0 * SUM(CASE WHEN license IS NOT NULL AND license != '' THEN 1 ELSE 0 END) / COUNT(*), 1) AS license_pct, ROUND(100.0 * SUM(CASE WHEN report_no IS NOT NULL AND report_no != '' THEN 1 ELSE 0 END) / COUNT(*), 1) AS report_no_pct FROM read_parquet('hf://datasets/open-index/open-arxiv/data/*/*.parquet'); ``` </details> The required fields (`id`, `title`, `categories`, `abstract`, `versions`, `update_date`, `authors_parsed`) are present for every paper. ## Versions and Revisions Most papers start at v1 and stay there, but a good number get updates. 39.7% of all papers have been revised at least once. On average, papers have **1.6** versions, and the most-revised paper reached **v187**. <details> <summary>SQL query to reproduce these stats</summary> ```sql SELECT ROUND(AVG(json_array_length(versions)), 2) AS avg_versions, MAX(json_array_length(versions)) AS max_versions, ROUND(100.0 * SUM(CASE WHEN json_array_length(versions) > 1 THEN 1 ELSE 0 END) / COUNT(*), 1) AS revised_pct FROM read_parquet('hf://datasets/open-index/open-arxiv/data/*/*.parquet'); ``` </details> ## Licenses arXiv papers carry their own license terms. Here is the breakdown: | License | Papers | Share | |---------|-------:|------:| | arXiv non-exclusive | 1.83M | 61.3% | | CC BY | 502.2K | 16.8% | | Not specified | 452.8K | 15.1% | | CC BY-NC-ND | 78.9K | 2.6% | | CC BY-NC-SA | 59.3K | 2.0% | | CC BY-SA | 27.2K | 0.9% | | CC0 (Public Domain) | 20.0K | 0.7% | | CC BY | 7.9K | 0.3% | | CC BY-NC-SA | 5.9K | 0.2% | | creativecommons.org/licenses/publicdo... | 2.5K | 0.1% | <details> <summary>SQL query to reproduce this table</summary> ```sql SELECT COALESCE(license, '') AS license, COUNT(*) AS papers FROM read_parquet('hf://datasets/open-index/open-arxiv/data/*/*.parquet') GROUP BY license ORDER BY papers DESC; ``` </details> ## What is being released? Every arXiv paper's metadata is stored as one row in the Parquet files, organized by submission month: ``` data/ 1991/ 1991-01.parquet 1991-02.parquet ... 2007/ 2007-04.parquet ... 2026/ 2026-01.parquet 2026-02.parquet 2026-03.parquet ``` Each shard contains all papers submitted in that month. This layout makes incremental updates efficient: when arXiv is updated weekly, only the current month's shard changes while all historical months stay the same. Each row includes the paper ID, title, abstract, authors (both raw and parsed), categories, DOI, journal reference, version history, and more. The `versions` and `authors_parsed` fields are stored as JSON strings for maximum compatibility. ## How to download and use Open arXiv ### Using `datasets` ```python from datasets import load_dataset # Stream the entire dataset ds = load_dataset("open-index/open-arxiv", split="train", streaming=True) for paper in ds: print(paper["id"], paper["title"]) # Load into memory ds = load_dataset("open-index/open-arxiv", split="train") print(f"{len(ds):,} papers loaded") ``` ### Using `huggingface_hub` ```python from huggingface_hub import snapshot_download folder = snapshot_download( "open-index/open-arxiv", repo_type="dataset", local_dir="./open-arxiv/", ) ``` For faster downloads, install `pip install huggingface_hub[hf_transfer]` and set `HF_HUB_ENABLE_HF_TRANSFER=1`. ### Using DuckDB ```sql -- Count papers per top-level category SELECT split_part(categories, '.', 1) AS area, COUNT(*) AS papers FROM read_parquet('hf://datasets/open-index/open-arxiv/data/*/*.parquet') GROUP BY area ORDER BY papers DESC LIMIT 20; ``` ### Working with JSON fields ```python import json from datasets import load_dataset ds = load_dataset("open-index/open-arxiv", split="train", streaming=True) for paper in ds: # Parse structured authors authors = json.loads(paper["authors_parsed"]) for last, first, suffix in authors: print(f" {first} {last}") # Parse version history versions = json.loads(paper["versions"]) print(f" First submitted: {versions[0]['created']}") print(f" Latest version: v{len(versions)}") ``` # Dataset card for Open arXiv ## Dataset Description - **Homepage:** [https://huggingface.co/datasets/open-index/open-arxiv](https://huggingface.co/datasets/open-index/open-arxiv) - **Source:** [Cornell-University/arxiv on Kaggle](https://www.kaggle.com/datasets/Cornell-University/arxiv) - **Point of Contact:** please create a discussion on the Community tab - **License:** CC0 1.0 Universal (Public Domain Dedication) ## Dataset Structure ### Data Instance ```json { "id": "0704.0001", "submitter": "Pavel Nadolsky", "authors": "C. Balazs, E. L. Berger, P. M. Nadolsky, C.-P. Yuan", "title": "Calculation of prompt diphoton production cross sections at Tevatron and LHC energies", "comments": "37 pages, 15 figures; published version", "journal_ref": "Phys.Rev.D76:013009,2007", "doi": "10.1103/PhysRevD.76.013009", "report_no": "ANL-HEP-PR-07-12", "categories": "hep-ph", "license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/", "abstract": " A fully differential calculation in perturbative quantum chromodynamics is presented for the production of massive photon pairs...", "versions": "[{\"version\": \"v1\", \"created\": \"Mon, 2 Apr 2007 19:18:42 GMT\"}, {\"version\": \"v2\", \"created\": \"Tue, 24 Jul 2007 20:10:27 GMT\"}]", "update_date": "2008-11-26", "authors_parsed": "[[\"Balazs\", \"C.\", \"\"], [\"Berger\", \"E. L.\", \"\"], [\"Nadolsky\", \"P. M.\", \"\"], [\"Yuan\", \"C. -P.\", \"\"]]" } ``` ### Data Fields | Column | Type | Description | |--------|------|-------------| | `id` | string | arXiv paper ID (e.g. `"0704.0001"` or `"2301.12345"`) | | `submitter` | string (nullable) | Person who submitted the paper | | `authors` | string (nullable) | Raw author string exactly as submitted | | `title` | string | Paper title | | `comments` | string (nullable) | Free-form comments (e.g. "37 pages, 15 figures; published version") | | `journal_ref` | string (nullable) | Journal citation (e.g. "Phys.Rev.D76:013009,2007") | | `doi` | string (nullable) | Digital Object Identifier | | `report_no` | string (nullable) | Institutional report number | | `categories` | string | Space-separated arXiv categories (e.g. `"hep-ph"` or `"math.CO cs.CG"`) | | `license` | string (nullable) | License URL for the paper content | | `abstract` | string | Paper abstract | | `versions` | string (JSON) | Version history array: `[{"version": "v1", "created": "..."}]` | | `update_date` | string | Last metadata update date (YYYY-MM-DD) | | `authors_parsed` | string (JSON) | Structured authors: `[["LastName", "FirstName", "Suffix"]]` | ### Data Splits The dataset has a single `train` split containing all 2.99M papers. ## Dataset Creation ### Curation Rationale The original arXiv metadata on Kaggle is distributed as a single large JSONL file (about 4-5 GB), which is inconvenient for streaming, SQL queries, and integration with modern ML tooling. **Open arXiv** converts this to columnar Parquet format with Zstd compression, enabling: - **Streaming** via HuggingFace `datasets` without downloading the full file - **SQL queries** via DuckDB directly from HuggingFace URLs - **Column pruning** to load only the fields you need (e.g. just titles and abstracts) - **Efficient filtering** by category, date, or any other field ### Source Data The source is the [Cornell University arXiv dataset](https://www.kaggle.com/datasets/Cornell-University/arxiv) on Kaggle, which provides metadata for all papers on [arXiv.org](https://arxiv.org). The Kaggle dataset is updated weekly by Cornell University. ### Data Processing Steps 1. **Download** the dataset ZIP from Kaggle (about 1.7 GB compressed) 2. **Extract** the JSONL file (`arxiv-metadata-oai-snapshot.json`, about 4-5 GB) 3. **Convert** to Parquet shards using a streaming Go pipeline, where each line is parsed with `gjson` for zero-allocation field extraction, then written to Zstd-compressed Parquet via `parquet-go` 4. **Analyze** the parquet shards with DuckDB to compute dataset statistics 5. **Upload** to HuggingFace via the `huggingface_hub` xet-aware uploader ### Field Name Changes The original Kaggle dataset uses hyphenated field names (`journal-ref`, `report-no`). These are converted to underscores (`journal_ref`, `report_no`) for compatibility with Parquet column naming conventions and Python attribute access. ### Complex Fields The `versions` and `authors_parsed` fields contain nested structures (arrays of objects/arrays) that cannot be directly represented in flat Parquet columns. They are stored as JSON strings. Use `json.loads()` in Python to parse them. ## Considerations for Using the Data ### Social Impact By converting arXiv metadata to an accessible columnar format, we aim to lower the barrier for scientific text mining, citation analysis, and research trend studies. The dataset enables researchers to explore the full history of arXiv without needing Kaggle credentials or parsing large JSON files. ### Known Limitations - This dataset contains **metadata only**. Paper PDFs and source files are not included. For full-text access, see [arXiv on Google Cloud Storage](https://cloud.google.com/storage/docs/public-datasets/arxiv). - The `abstract` field may contain leading whitespace and LaTeX notation from the original submissions. - Author disambiguation is not performed. Use the `authors_parsed` field for structured name access. ## Additional Information ### Licensing The dataset is released under **CC0 1.0 Universal (Public Domain Dedication)**, the same license as the [original Kaggle dataset](https://www.kaggle.com/datasets/Cornell-University/arxiv). Individual papers on arXiv have their own licenses, see the `license` field in each row. ### Contact Please open a discussion on the [Community tab](https://huggingface.co/datasets/open-index/open-arxiv/discussions) for questions, feedback, or issues. ### Last Updated 2026-03-24

提供机构：

open-index

5,000+

优质数据集

54 个

任务类型

进入经典数据集