permutans/arxiv-papers-by-subject

Name: permutans/arxiv-papers-by-subject
Creator: permutans
Published: 2025-12-21 14:22:17
License: 暂无描述

Hugging Face2025-12-21 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/permutans/arxiv-papers-by-subject

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit task_categories: - text-generation - feature-extraction language: - en tags: - arxiv - academic-papers - scientific-literature - research - metadata size_categories: - 1M<n<10M source_datasets: - nick007x/arxiv-papers configs: - config_name: default data_files: - split: train path: "data/**/*.parquet" --- # arXiv Papers by Subject A reorganised version of the [nick007x/arxiv-papers](https://huggingface.co/datasets/nick007x/arxiv-papers) dataset, partitioned by subject code, year, and month for efficient selective access. ## Dataset Description This dataset contains metadata for over 2.5 million arXiv papers, organised into a hierarchical directory structure that allows users to download only the specific subjects and time periods they need, rather than the entire dataset. ### Motivation The original [nick007x/arxiv-papers](https://huggingface.co/datasets/nick007x/arxiv-papers) dataset is an excellent resource containing comprehensive arXiv paper metadata. However, its monolithic structure requires downloading the entire dataset even when only a subset of papers is needed. This derived dataset addresses that limitation by partitioning the data into small, focused parquet files organised by: 1. **Subject code** (e.g., `cs.AI`, `astro-ph.CO`, `math.NA`) 2. **Year** (1989–2025) 3. **Month** (01–12) This structure enables: - Downloading only specific research domains - Fetching data for particular time ranges - Incremental updates as new papers are published - Efficient caching and lazy loading ## Dataset Structure ``` data/ ├── astro-ph.CO/ │ ├── 2009/ │ │ ├── 01/ │ │ │ └── 00000000.parquet │ │ ├── 02/ │ │ │ └── 00000000.parquet │ │ └── ... │ └── ... ├── cs.AI/ │ ├── 1993/ │ │ └── ... │ └── 2025/ │ └── ... ├── cs.LG/ │ └── ... └── ... ``` ### Subject Categories The dataset includes 148 arXiv subject categories spanning: | Domain | Example Categories | |--------|-------------------| | Astrophysics | `astro-ph.*` x 6 | | Condensed Matter | `cond-mat.*` x 9 | | Computer Science | `cs.*` x 60 | | Economics | `econ.*` x 3 | | Electrical Engineering | `eess.*` x 4 | | Mathematics | `math.*` x 30 | | Physics | `gr-qc`, `hep-*` x 4, `nucl-*` x 2, `quant-ph`, `physics.*` x 22 | | Quantitative Biology | `q-bio.*` x 10 | | Quantitative Finance | `q-fin.*` x 8 | | Statistics | `stat.*` x 5 | | Nonlinear Sciences | `nlin.*` x 5 | ### Data Fields Each parquet file contains the following fields (inherited from the source dataset): | Field | Type | Description | |-------|------|-------------| | `arxiv_id` | string | Unique arXiv identifier (e.g., `2301.00001`) | | `title` | string | Paper title | | `authors` | list[string] | List of author names | | `submission_date` | string | Date of submission (e.g., `18 Feb 2009`) | | `comments` | string | Author comments (page count, figures, etc.) | | `primary_subject` | string | Primary arXiv category with description | | `subjects` | string | All arXiv categories the paper belongs to | | `doi` | string | DOI link if available | | `abstract` | string | Paper abstract | | `file_path` | string | Path to PDF in the source dataset | - Note that the ZIP files in `file_path` point to [nick007x/arxiv-papers](https://huggingface.co/datasets/nick007x/arxiv-papers) ! ## Usage ### Loading Specific Subjects and Time Periods ```python from huggingface_hub import hf_hub_download # Download a specific subject/year/month local_path = hf_hub_download( repo_id="permutans/arxiv-papers-by-subject", repo_type="dataset", filename="data/cs.LG/2024/06/00000000.parquet" ) import polars as pl df = pl.read_parquet(local_path) ``` ### Loading Multiple Files with Glob Patterns ```python from huggingface_hub import snapshot_download # Download all cs.LG papers from 2024 snapshot_download( repo_id="permutans/arxiv-papers-by-subject", repo_type="dataset", allow_patterns="data/cs.LG/2024/*/*.parquet", local_dir="./arxiv_data" ) ``` ### Using with Polars LazyFrames ```python import polars as pl # Scan multiple files lazily lf = pl.scan_parquet("arxiv_data/data/cs.*/2024/*/*.parquet") # Filter and collect only what you need recent_ml = lf.filter( pl.col("primary_subject").str.contains("Machine Learning") ).collect() ``` ## Dataset Statistics - **Total papers**: ~2.55 million - **Subject categories**: 167 - **Year range**: 1998–2025 - **File format**: Parquet (compressed) ## Source Attribution This dataset is derived from [nick007x/arxiv-papers](https://huggingface.co/datasets/nick007x/arxiv-papers), which provides the complete arXiv scientific papers archive. The original dataset contains both metadata and PDFs; this derived dataset includes only the metadata, reorganised for efficient partial access. The underlying paper content originates from [arXiv.org](https://arxiv.org), operated by Cornell University. ## License This dataset follows the licensing structure of the source: - **Dataset packaging and organisation**: MIT License, as for [nick007x/arxiv-papers](https://huggingface.co/datasets/nick007x/arxiv-papers) - **Individual paper content**: Subject to each paper's license as specified by arXiv and the respective authors ## Citation If you use this dataset, please cite both this reorganized version and the original source: ```bibtex @dataset{arxiv_papers_by_subject_2025, title = {arXiv Papers by Subject}, author = {permutans}, year = {2025}, publisher = {Hugging Face}, url = {https://huggingface.co/datasets/permutans/arxiv-papers-by-subject} } @dataset{arxiv_papers_2025, title = {arXiv Papers Dataset}, author = {nick007x}, year = {2025}, publisher = {Hugging Face}, url = {https://huggingface.co/datasets/nick007x/arxiv-papers} } ```

许可证：MIT协议任务类别： - 文本生成 - 特征提取语言： - 英语标签： - arXiv - 学术论文 - 科学文献 - 研究 - 元数据（metadata）规模类别： - 100万<n<1000万源数据集： - nick007x/arxiv-papers 配置项： - 配置名称：default 数据文件： - 拆分方式：train 路径："data/**/*.parquet" # 按主题分类的arXiv论文数据集本数据集是对[nick007x/arxiv-papers](https://huggingface.co/datasets/nick007x/arxiv-papers)数据集的重构版本，按照主题代码、年份与月份进行分区，以实现高效的选择性数据访问。 ## 数据集描述本数据集包含超过250万篇arXiv论文的元数据（metadata），采用分层目录结构，用户仅需下载自身所需的特定主题与时间段的数据，而非完整数据集。 ### 设计动机原始[nick007x/arxiv-papers](https://huggingface.co/datasets/nick007x/arxiv-papers)数据集是包含全面arXiv论文元数据的优质资源，但其整体式结构要求用户即使仅需部分论文数据，也必须下载完整数据集。本派生数据集通过将数据划分为小型、聚焦的Parquet文件解决了这一局限，分区依据如下： 1. **主题代码**（例如 `cs.AI`、`astro-ph.CO`、`math.NA`） 2. **年份**（1989–2025） 3. **月份**（01–12）该结构支持以下操作： - 仅下载特定研究领域的数据 - 获取特定时间范围的数据集 - 随新论文发表实现增量更新 - 高效缓存与延迟加载 ## 数据集结构 data/ ├── astro-ph.CO/ │ ├── 2009/ │ │ ├── 01/ │ │ │ └── 00000000.parquet │ │ ├── 02/ │ │ │ └── 00000000.parquet │ │ └── ... │ └── ... ├── cs.AI/ │ ├── 1993/ │ │ └── ... │ └── 2025/ │ └── ... ├── cs.LG/ │ └── ... └── ... ### 主题分类本数据集涵盖148个arXiv主题分类，涵盖以下领域： | 领域分类 | 示例分类 | |--------|-------------------| | 天体物理学 | `astro-ph.*` × 6 | | 凝聚态物理 | `cond-mat.*` × 9 | | 计算机科学 | `cs.*` × 60 | | 经济学 | `econ.*` × 3 | | 电子工程 | `eess.*` × 4 | | 数学 | `math.*` × 30 | | 物理学 | `gr-qc`、`hep-*` ×4、`nucl-*` ×2、`quant-ph`、`physics.*` ×22 | | 定量生物学 | `q-bio.*` ×10 | | 定量金融 | `q-fin.*` ×8 | | 统计学 | `stat.*` ×5 | | 非线性科学 | `nlin.*` ×5 | ### 数据字段每个Parquet文件包含以下字段（继承自源数据集）： | 字段名 | 数据类型 | 字段说明 | |-------|------|-------------| | `arxiv_id` | 字符串 | 唯一arXiv标识符（例如 `2301.00001`） | | `title` | 字符串 | 论文标题 | | `authors` | 字符串列表 | 作者姓名列表 | | `submission_date` | 字符串 | 提交日期（例如 `18 Feb 2009`） | | `comments` | 字符串 | 作者备注（包括页数、图表等信息） | | `primary_subject` | 字符串 | 带描述的主要arXiv分类 | | `subjects` | 字符串 | 论文所属的所有arXiv分类 | | `doi` | 字符串 | 可用的DOI链接 | | `abstract` | 字符串 | 论文摘要 | | `file_path` | 字符串 | 源数据集中PDF文件的路径 | > 注意：`file_path`中的ZIP文件指向[nick007x/arxiv-papers](https://huggingface.co/datasets/nick007x/arxiv-papers)数据集！ ## 使用方法 ### 加载特定主题与时间段的数据 python from huggingface_hub import hf_hub_download # 下载特定主题/年份/月份的数据 local_path = hf_hub_download( repo_id="permutans/arxiv-papers-by-subject", repo_type="dataset", filename="data/cs.LG/2024/06/00000000.parquet" ) import polars as pl df = pl.read_parquet(local_path) ### 使用通配符加载多个文件 python from huggingface_hub import snapshot_download # 下载2024年所有cs.LG主题的论文数据 snapshot_download( repo_id="permutans/arxiv-papers-by-subject", repo_type="dataset", allow_patterns="data/cs.LG/2024/*/*.parquet", local_dir="./arxiv_data" ) ### 使用Polars延迟加载框架 python import polars as pl # 延迟扫描多个文件 lf = pl.scan_parquet("arxiv_data/data/cs.*/2024/*/*.parquet") # 筛选并仅收集所需数据 recent_ml = lf.filter( pl.col("primary_subject").str.contains("机器学习") ).collect() ## 数据集统计信息 - **总论文数**：约255万篇 - **主题分类数**：167个 - **年份范围**：1998–2025 - **文件格式**：Parquet（压缩格式） ## 来源声明本数据集派生自[nick007x/arxiv-papers](https://huggingface.co/datasets/nick007x/arxiv-papers)，该源数据集提供完整的arXiv科学论文存档。原始数据集同时包含元数据与PDF文件；本派生数据集仅包含元数据，并重构为支持高效部分访问的结构。本数据集的论文内容源自由康奈尔大学运营的[arXiv.org](https://arxiv.org)。 ## 许可协议本数据集遵循源数据集的许可规则： - **数据集打包与组织**：采用MIT协议，与[nick007x/arxiv-papers](https://huggingface.co/datasets/nick007x/arxiv-papers)一致 - **单篇论文内容**：遵循arXiv及各作者指定的论文许可协议 ## 引用方式若使用本数据集，请同时引用此重构版本与原始源数据集： bibtex @dataset{arxiv_papers_by_subject_2025, title = {arXiv Papers by Subject}, author = {permutans}, year = {2025}, publisher = {Hugging Face}, url = {https://huggingface.co/datasets/permutans/arxiv-papers-by-subject} } @dataset{arxiv_papers_2025, title = {arXiv Papers Dataset}, author = {nick007x}, year = {2025}, publisher = {Hugging Face}, url = {https://huggingface.co/datasets/nick007x/arxiv-papers} }

提供机构：

permutans

5,000+

优质数据集

54 个

任务类型

进入经典数据集