Vidushee/ArXiv-Papers-150K
收藏Hugging Face2026-03-24 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Vidushee/ArXiv-Papers-150K
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc-by-nc-4.0
task_categories:
- text-generation
- document-question-answering
language:
- en
pretty_name: ArXiv Papers 150K
size_categories:
- 100K<n<1M
tags:
- arxiv
- latex
- scientific-papers
- machine-learning
- computer-vision
- nlp
- research-papers
- llm
- rl
- latex-source
- ocr
- arxiv-ml-papers
viewer: false
---
# ArXiv-Papers-150K
150K+ ArXiv papers as raw LaTeX source archives, covering major AI/ML conferences (2016--2026).
## Overview
| | |
|---|---|
| **Papers** | 150,334 |
| **Size** | ~285 GB |
| **Format** | `.tar.gz` per paper (original ArXiv source) |
| **Years** | 2016 -- 2026 |
| **Categories** | cs.LG, cs.CV, cs.CL, cs.AI, stat.ML, cs.NE, cs.SD, eess.AS, cs.RO |
## Category Breakdown
| Category | Papers | Description |
|---|---|---|
| cs.LG | 54,200 | Machine Learning (ICML, NeurIPS, ICLR) |
| cs.CV | 35,000 | Computer Vision (CVPR, ECCV, ICCV) |
| cs.CL | 25,000 | Computation & Language (ACL, EMNLP) |
| cs.AI | 20,000 | Artificial Intelligence (AAAI, IJCAI) |
| stat.ML | 7,803 | Statistical Machine Learning |
| cs.NE | 2,500 | Neural & Evolutionary Computing (IJCNN) |
| cs.RO | 2,500 | Robotics |
| cs.SD | 1,500 | Sound |
| eess.AS | 1,500 | Audio & Speech |
## Dataset Structure
```
repo/
metadata.parquet # Paper metadata (title, abstract, authors, categories, year, chunk_file)
cs.LG/
cs.LG_part_000.tar # ~10GB each, contains individual paper .tar.gz files
cs.LG_part_001.tar
...
cs.CV/
cs.CV_part_000.tar
...
```
Each `.tar` chunk contains individual paper archives (`.tar.gz`). Each paper archive contains the LaTeX source as submitted to ArXiv: `.tex` files, figures, `.bib` references, style files, etc.
## Metadata
`metadata.parquet` contains the following columns for all 150K papers:
| Column | Type | Description |
|---|---|---|
| `paper_id` | string | ArXiv ID (e.g., `2401.12345`) |
| `title` | string | Paper title |
| `abstract` | string | Abstract |
| `authors` | string | Comma-separated author names |
| `categories` | string | All ArXiv categories |
| `primary_category` | string | Primary ArXiv category |
| `published` | string | Publication date |
| `year` | int | Year extracted from ArXiv ID |
| `chunk_file` | string | Which tar chunk contains this paper |
| `doi` | string | DOI if available |
| `journal_ref` | string | Journal reference if available |
## Quick Start
**Load metadata:**
```python
import pandas as pd
df = pd.read_parquet("hf://datasets/Vidushee/ArXiv-Papers-150K/metadata.parquet")
print(df.shape) # (150334, 12)
```
**Download and extract a category:**
```python
from huggingface_hub import hf_hub_download
import tarfile, os
# Download one chunk
path = hf_hub_download(
repo_id="Vidushee/ArXiv-Papers-150K",
filename="cs.LG/cs.LG_part_000.tar",
repo_type="dataset",
)
# Extract paper archives
with tarfile.open(path) as tar:
tar.extractall("./cs_LG_papers/")
# Each file is a paper's source archive
# Extract a single paper
paper = "./cs_LG_papers/2401.12345.tar.gz"
with tarfile.open(paper) as t:
t.extractall("./paper_source/")
# Now you have: main.tex, figures/, references.bib, etc.
```
## Use Cases
- LLM pre-training on scientific text
- Scientific document understanding
- LaTeX generation and completion
- Citation network analysis
- Figure extraction and captioning
## Source
All papers sourced from [arxiv.org](https://arxiv.org) via the `/e-print/` endpoint. Metadata from [anonymousatom/arxiv-metadata](https://huggingface.co/datasets/anonymousatom/arxiv-metadata). Papers distributed under their original ArXiv licenses.
## Citation
```bibtex
@dataset{arxiv_papers_150k,
title={ArXiv-Papers-150K: LaTeX Source Archives for 150K AI/ML Papers},
year={2026},
url={https://huggingface.co/datasets/Vidushee/ArXiv-Papers-150K},
note={150K ArXiv paper source archives covering cs.LG, cs.CV, cs.CL, cs.AI, stat.ML and related categories (2016-2026)}
}
```
提供机构:
Vidushee



