permutans/arxiv-papers-by-subject
收藏Hugging Face2025-12-21 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/permutans/arxiv-papers-by-subject
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
task_categories:
- text-generation
- feature-extraction
language:
- en
tags:
- arxiv
- academic-papers
- scientific-literature
- research
- metadata
size_categories:
- 1M<n<10M
source_datasets:
- nick007x/arxiv-papers
configs:
- config_name: default
data_files:
- split: train
path: "data/**/*.parquet"
---
# arXiv Papers by Subject
A reorganised version of the [nick007x/arxiv-papers](https://huggingface.co/datasets/nick007x/arxiv-papers) dataset, partitioned by subject code, year, and month for efficient selective access.
## Dataset Description
This dataset contains metadata for over 2.5 million arXiv papers, organised into a hierarchical directory structure that allows users to download only the specific subjects and time periods they need, rather than the entire dataset.
### Motivation
The original [nick007x/arxiv-papers](https://huggingface.co/datasets/nick007x/arxiv-papers) dataset is an excellent resource containing comprehensive arXiv paper metadata. However, its monolithic structure requires downloading the entire dataset even when only a subset of papers is needed.
This derived dataset addresses that limitation by partitioning the data into small, focused parquet files organised by:
1. **Subject code** (e.g., `cs.AI`, `astro-ph.CO`, `math.NA`)
2. **Year** (1989–2025)
3. **Month** (01–12)
This structure enables:
- Downloading only specific research domains
- Fetching data for particular time ranges
- Incremental updates as new papers are published
- Efficient caching and lazy loading
## Dataset Structure
```
data/
├── astro-ph.CO/
│ ├── 2009/
│ │ ├── 01/
│ │ │ └── 00000000.parquet
│ │ ├── 02/
│ │ │ └── 00000000.parquet
│ │ └── ...
│ └── ...
├── cs.AI/
│ ├── 1993/
│ │ └── ...
│ └── 2025/
│ └── ...
├── cs.LG/
│ └── ...
└── ...
```
### Subject Categories
The dataset includes 148 arXiv subject categories spanning:
| Domain | Example Categories |
|--------|-------------------|
| Astrophysics | `astro-ph.*` x 6 |
| Condensed Matter | `cond-mat.*` x 9 |
| Computer Science | `cs.*` x 60 |
| Economics | `econ.*` x 3 |
| Electrical Engineering | `eess.*` x 4 |
| Mathematics | `math.*` x 30 |
| Physics | `gr-qc`, `hep-*` x 4, `nucl-*` x 2, `quant-ph`, `physics.*` x 22 |
| Quantitative Biology | `q-bio.*` x 10 |
| Quantitative Finance | `q-fin.*` x 8 |
| Statistics | `stat.*` x 5 |
| Nonlinear Sciences | `nlin.*` x 5 |
### Data Fields
Each parquet file contains the following fields (inherited from the source dataset):
| Field | Type | Description |
|-------|------|-------------|
| `arxiv_id` | string | Unique arXiv identifier (e.g., `2301.00001`) |
| `title` | string | Paper title |
| `authors` | list[string] | List of author names |
| `submission_date` | string | Date of submission (e.g., `18 Feb 2009`) |
| `comments` | string | Author comments (page count, figures, etc.) |
| `primary_subject` | string | Primary arXiv category with description |
| `subjects` | string | All arXiv categories the paper belongs to |
| `doi` | string | DOI link if available |
| `abstract` | string | Paper abstract |
| `file_path` | string | Path to PDF in the source dataset |
- Note that the ZIP files in `file_path` point to [nick007x/arxiv-papers](https://huggingface.co/datasets/nick007x/arxiv-papers) !
## Usage
### Loading Specific Subjects and Time Periods
```python
from huggingface_hub import hf_hub_download
# Download a specific subject/year/month
local_path = hf_hub_download(
repo_id="permutans/arxiv-papers-by-subject",
repo_type="dataset",
filename="data/cs.LG/2024/06/00000000.parquet"
)
import polars as pl
df = pl.read_parquet(local_path)
```
### Loading Multiple Files with Glob Patterns
```python
from huggingface_hub import snapshot_download
# Download all cs.LG papers from 2024
snapshot_download(
repo_id="permutans/arxiv-papers-by-subject",
repo_type="dataset",
allow_patterns="data/cs.LG/2024/*/*.parquet",
local_dir="./arxiv_data"
)
```
### Using with Polars LazyFrames
```python
import polars as pl
# Scan multiple files lazily
lf = pl.scan_parquet("arxiv_data/data/cs.*/2024/*/*.parquet")
# Filter and collect only what you need
recent_ml = lf.filter(
pl.col("primary_subject").str.contains("Machine Learning")
).collect()
```
## Dataset Statistics
- **Total papers**: ~2.55 million
- **Subject categories**: 167
- **Year range**: 1998–2025
- **File format**: Parquet (compressed)
## Source Attribution
This dataset is derived from [nick007x/arxiv-papers](https://huggingface.co/datasets/nick007x/arxiv-papers), which provides the complete arXiv scientific papers archive. The original dataset contains both metadata and PDFs; this derived dataset includes only the metadata, reorganised for efficient partial access.
The underlying paper content originates from [arXiv.org](https://arxiv.org), operated by Cornell University.
## License
This dataset follows the licensing structure of the source:
- **Dataset packaging and organisation**: MIT License, as for [nick007x/arxiv-papers](https://huggingface.co/datasets/nick007x/arxiv-papers)
- **Individual paper content**: Subject to each paper's license as specified by arXiv and the respective authors
## Citation
If you use this dataset, please cite both this reorganized version and the original source:
```bibtex
@dataset{arxiv_papers_by_subject_2025,
title = {arXiv Papers by Subject},
author = {permutans},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/datasets/permutans/arxiv-papers-by-subject}
}
@dataset{arxiv_papers_2025,
title = {arXiv Papers Dataset},
author = {nick007x},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/datasets/nick007x/arxiv-papers}
}
```
许可证:MIT协议
任务类别:
- 文本生成
- 特征提取
语言:
- 英语
标签:
- arXiv
- 学术论文
- 科学文献
- 研究
- 元数据(metadata)
规模类别:
- 100万<n<1000万
源数据集:
- nick007x/arxiv-papers
配置项:
- 配置名称:default
数据文件:
- 拆分方式:train
路径:"data/**/*.parquet"
# 按主题分类的arXiv论文数据集
本数据集是对[nick007x/arxiv-papers](https://huggingface.co/datasets/nick007x/arxiv-papers)数据集的重构版本,按照主题代码、年份与月份进行分区,以实现高效的选择性数据访问。
## 数据集描述
本数据集包含超过250万篇arXiv论文的元数据(metadata),采用分层目录结构,用户仅需下载自身所需的特定主题与时间段的数据,而非完整数据集。
### 设计动机
原始[nick007x/arxiv-papers](https://huggingface.co/datasets/nick007x/arxiv-papers)数据集是包含全面arXiv论文元数据的优质资源,但其整体式结构要求用户即使仅需部分论文数据,也必须下载完整数据集。
本派生数据集通过将数据划分为小型、聚焦的Parquet文件解决了这一局限,分区依据如下:
1. **主题代码**(例如 `cs.AI`、`astro-ph.CO`、`math.NA`)
2. **年份**(1989–2025)
3. **月份**(01–12)
该结构支持以下操作:
- 仅下载特定研究领域的数据
- 获取特定时间范围的数据集
- 随新论文发表实现增量更新
- 高效缓存与延迟加载
## 数据集结构
data/
├── astro-ph.CO/
│ ├── 2009/
│ │ ├── 01/
│ │ │ └── 00000000.parquet
│ │ ├── 02/
│ │ │ └── 00000000.parquet
│ │ └── ...
│ └── ...
├── cs.AI/
│ ├── 1993/
│ │ └── ...
│ └── 2025/
│ └── ...
├── cs.LG/
│ └── ...
└── ...
### 主题分类
本数据集涵盖148个arXiv主题分类,涵盖以下领域:
| 领域分类 | 示例分类 |
|--------|-------------------|
| 天体物理学 | `astro-ph.*` × 6 |
| 凝聚态物理 | `cond-mat.*` × 9 |
| 计算机科学 | `cs.*` × 60 |
| 经济学 | `econ.*` × 3 |
| 电子工程 | `eess.*` × 4 |
| 数学 | `math.*` × 30 |
| 物理学 | `gr-qc`、`hep-*` ×4、`nucl-*` ×2、`quant-ph`、`physics.*` ×22 |
| 定量生物学 | `q-bio.*` ×10 |
| 定量金融 | `q-fin.*` ×8 |
| 统计学 | `stat.*` ×5 |
| 非线性科学 | `nlin.*` ×5 |
### 数据字段
每个Parquet文件包含以下字段(继承自源数据集):
| 字段名 | 数据类型 | 字段说明 |
|-------|------|-------------|
| `arxiv_id` | 字符串 | 唯一arXiv标识符(例如 `2301.00001`) |
| `title` | 字符串 | 论文标题 |
| `authors` | 字符串列表 | 作者姓名列表 |
| `submission_date` | 字符串 | 提交日期(例如 `18 Feb 2009`) |
| `comments` | 字符串 | 作者备注(包括页数、图表等信息) |
| `primary_subject` | 字符串 | 带描述的主要arXiv分类 |
| `subjects` | 字符串 | 论文所属的所有arXiv分类 |
| `doi` | 字符串 | 可用的DOI链接 |
| `abstract` | 字符串 | 论文摘要 |
| `file_path` | 字符串 | 源数据集中PDF文件的路径 |
> 注意:`file_path`中的ZIP文件指向[nick007x/arxiv-papers](https://huggingface.co/datasets/nick007x/arxiv-papers)数据集!
## 使用方法
### 加载特定主题与时间段的数据
python
from huggingface_hub import hf_hub_download
# 下载特定主题/年份/月份的数据
local_path = hf_hub_download(
repo_id="permutans/arxiv-papers-by-subject",
repo_type="dataset",
filename="data/cs.LG/2024/06/00000000.parquet"
)
import polars as pl
df = pl.read_parquet(local_path)
### 使用通配符加载多个文件
python
from huggingface_hub import snapshot_download
# 下载2024年所有cs.LG主题的论文数据
snapshot_download(
repo_id="permutans/arxiv-papers-by-subject",
repo_type="dataset",
allow_patterns="data/cs.LG/2024/*/*.parquet",
local_dir="./arxiv_data"
)
### 使用Polars延迟加载框架
python
import polars as pl
# 延迟扫描多个文件
lf = pl.scan_parquet("arxiv_data/data/cs.*/2024/*/*.parquet")
# 筛选并仅收集所需数据
recent_ml = lf.filter(
pl.col("primary_subject").str.contains("机器学习")
).collect()
## 数据集统计信息
- **总论文数**:约255万篇
- **主题分类数**:167个
- **年份范围**:1998–2025
- **文件格式**:Parquet(压缩格式)
## 来源声明
本数据集派生自[nick007x/arxiv-papers](https://huggingface.co/datasets/nick007x/arxiv-papers),该源数据集提供完整的arXiv科学论文存档。原始数据集同时包含元数据与PDF文件;本派生数据集仅包含元数据,并重构为支持高效部分访问的结构。
本数据集的论文内容源自由康奈尔大学运营的[arXiv.org](https://arxiv.org)。
## 许可协议
本数据集遵循源数据集的许可规则:
- **数据集打包与组织**:采用MIT协议,与[nick007x/arxiv-papers](https://huggingface.co/datasets/nick007x/arxiv-papers)一致
- **单篇论文内容**:遵循arXiv及各作者指定的论文许可协议
## 引用方式
若使用本数据集,请同时引用此重构版本与原始源数据集:
bibtex
@dataset{arxiv_papers_by_subject_2025,
title = {arXiv Papers by Subject},
author = {permutans},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/datasets/permutans/arxiv-papers-by-subject}
}
@dataset{arxiv_papers_2025,
title = {arXiv Papers Dataset},
author = {nick007x},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/datasets/nick007x/arxiv-papers}
}
提供机构:
permutans



