arxiv-papers

Name: arxiv-papers
Creator: maas
Published: 2025-12-05 16:54:55
License: 暂无描述

魔搭社区2025-12-05 更新2025-11-03 收录

下载链接：

https://modelscope.cn/datasets/nick007x/arxiv-papers

下载链接

链接失效反馈

官方服务：

资源简介：

license: mit language: - en size_categories: - 1T<n<10T task_categories: - text-to-image - visual-question-answering - document-question-answering - text-generation --- # Complete ArXiv Papers Dataset (4.68 TB) ## 📚 Dataset Overview This repository contains **the complete ArXiv scientific papers archive** organized by subject categories and publication years. With 4.68 TB of compressed PDFs and metadata, this represents one of the largest collections of scientific literature available for research and AI training. ## 🗂️ Dataset Structure ### Organized by Subject Categories: - **astro-ph** (00-22): Astrophysics - **cond-mat** (00-32): Condensed Matter Physics - **cs** (00-62): Computer Science (most extensive category) - **math** (00-52): Mathematics - **physics** (00-16): General Physics - **quant-ph** (00-12): Quantum Physics - **stat** (00-05): Statistics - **econ, eess, hep, nlin, q-bio, q-fin**: Specialized categories - Plus additional specialized domains ### File Organization: - Each category split into numbered segments (00, 01, 02...) - Large categories further divided into parts (part-1, part-2, etc.) - All files in ZIP format containing PDFs ## 📊 Dataset Statistics - **Total Size**: 4.68 TB (compressed) - **Format**: ZIP archives containing PDFs + metadata - **Coverage**: Complete ArXiv historical archive - **Organization**: By subject category ## 🎯 Primary Use Cases ### Multi-Modal AI Training - **Scientific Document Understanding**: Train models on full PDF content - **Figure-Caption Alignment**: Extract and pair scientific figures with their descriptions - **Mathematical Reasoning**: Process complex mathematical notation and derivations - **Cross-modal Retrieval**: Link textual concepts with visual scientific content ### Research Applications - **Bibliometric Analysis**: Track research trends across decades - **Scientific NLP**: Train domain-specific language models - **Knowledge Extraction**: Parse algorithms, methodologies, and results - **Academic Search**: Build enhanced scientific search engines ## 🛠️ Usage Examples ### Accessing Specific Categories ```python # Example: Access Computer Science papers from segment 00 from huggingface_hub import hf_hub_download import zipfile file_path = hf_hub_download( repo_id="nick007x/xiv-papers", filename="cs-00.zip" ) # Extract and process PDFs with zipfile.ZipFile(file_path, 'r') as zip_ref: zip_ref.extractall("cs_papers/") ``` ### Working with Metadata The `train.parquet` file contains structured metadata including: - arXiv IDs, titles, authors, submission dates - Abstracts, comments, primary subjects - File paths to corresponding PDFs ## ⚡ Quick Start 1. **Browse Categories**: Start with smaller categories like `gr-qc-00.zip` (4.07 GB) 2. **Extract Metadata**: Use `train.parquet` for paper discovery 3. **Targeted Download**: Download specific subject areas of interest 4. **Stream Processing**: Handle large files with streaming extraction ## 🌟 Value Proposition This dataset enables: - **Complete Scientific Coverage**: Every paper from ArXiv's history - **Multi-Domain Expertise**: Physics, CS, Math, Statistics, and more - **Ready for Foundation Models**: Perfect for training scientific AI - **Structured Organization**: Easy access by domain and time period ## 📜 License & Usage Terms **Important:** This dataset is a **collection** of individual scholarly works from arXiv.org. The licensing structure is as follows: * **The Collection (Metadata & Packaging):** The script used to create this dataset, the unique metadata (e.g., file structure, dataset description), and its packaging are licensed under the **MIT License**. * **The Individual Papers (Content):** Each paper (PDF/TeX source) remains under the copyright and license chosen by its respective author(s). These licenses are typically Creative Commons (e.g., CC BY, CC BY-NC, CC BY-NC-ND). * **Your Responsibility:** Users of this dataset are **solely responsible** for checking, understanding, and complying with the specific license terms of any paper they access, download, or use from this collection. You must provide appropriate attribution to the original authors as required by their chosen license. **By using this dataset, you agree to these terms and acknowledge that the dataset creator is not liable for any license violations resulting from your use of the contained papers.** For more information, see: * [arXiv's Terms of Use](https://info.arxiv.org/help/license/index.html) * [Creative Commons Licenses](https://creativecommons.org/licenses/) ## 🙏 Acknowledgments This dataset builds upon the incredible work of: - **ArXiv** team and moderators - **Paper authors** across all scientific domains - **Open scientific community** enabling knowledge sharing --- **Note**: Due to the massive size (4.68 TB), consider downloading specific categories of interest rather than the entire dataset. The organized structure makes targeted access straightforward.

license: MIT许可证 language: - 英语 size_categories: - 1万亿字节 < 数据量 < 10万亿字节 task_categories: - 文本到图像生成 - 视觉问答 - 文档问答 - 文本生成 --- # 完整ArXiv论文数据集（4.68 TB） ## 📚 数据集概览本仓库包含**完整的ArXiv学术论文存档**，按学科分类与发表年份组织。该数据集包含4.68 TB的压缩PDF文件与元数据（metadata），是当前可供研究与人工智能训练使用的规模最大的科学文献馆藏之一。 ## 🗂️ 数据集结构 ### 按学科分类组织： - **astro-ph**（00-22）：天体物理学（Astrophysics） - **cond-mat**（00-32）：凝聚态物理（Condensed Matter Physics） - **cs**（00-62）：计算机科学（Computer Science，为规模最大的分类） - **math**（00-52）：数学（Mathematics） - **physics**（00-16）：普通物理学（General Physics） - **quant-ph**（00-12）：量子物理学（Quantum Physics） - **stat**（00-05）：统计学（Statistics） - **econ、eess、hep、nlin、q-bio、q-fin**：专业细分领域 - 外加更多专业研究领域 ### 文件组织方式： - 每个分类拆分为编号分段（00、01、02……） - 大型分类会进一步划分为子部分（part-1、part-2等） - 所有文件均为ZIP格式压缩包，内含PDF文件 ## 📊 数据集统计信息 - **总大小**：4.68 TB（压缩格式） - **格式**：包含PDF与元数据的ZIP压缩包 - **覆盖范围**：完整的ArXiv历史存档 - **组织方式**：按学科分类 ## 🎯 主要应用场景 ### 多模态人工智能训练 - **科学文档理解**：基于完整PDF内容训练模型 - **图文标题对齐**：提取科学图表并与其描述配对 - **数学推理**：处理复杂数学符号与推导过程 - **跨模态检索**：将文本概念与可视化科学内容关联 ### 研究应用 - **文献计量分析（Bibliometric Analysis）**：追踪数十年间的研究趋势 - **科学自然语言处理（Scientific NLP）**：训练领域专属语言模型 - **知识抽取**：解析算法、研究方法与实验结果 - **学术搜索**：构建增强型科学搜索引擎 ## 🛠️ 使用示例 ### 访问特定学科分类 python # 示例：访问来自分段00的计算机科学论文 from huggingface_hub import hf_hub_download import zipfile file_path = hf_hub_download( repo_id="nick007x/xiv-papers", filename="cs-00.zip" ) # 解压并处理PDF文件 with zipfile.ZipFile(file_path, 'r') as zip_ref: zip_ref.extractall("cs_papers/") ### 处理元数据 `train.parquet` 文件包含结构化元数据，包括： - arXiv编号、论文标题、作者、提交日期 - 摘要、备注、主要学科分类 - 对应PDF文件的路径 ## ⚡ 快速入门指南 1. **浏览分类**：从较小的分类开始，例如`gr-qc-00.zip`（4.07 GB） 2. **提取元数据**：使用`train.parquet`文件进行论文检索 3. **定向下载**：下载您感兴趣的特定学科领域数据 4. **流式处理**：采用流式解压方式处理大型文件 ## 🌟 核心价值本数据集可支持： - **完整科学覆盖**：收录ArXiv历史上的全部论文 - **多领域专业覆盖**：涵盖物理学、计算机科学、数学、统计学等多个领域 - **适配基础模型训练**：是训练科学领域人工智能的理想选择 - **结构化组织方式**：可按学科与时间周期便捷获取数据 ## 📜 许可与使用条款 **重要提示**：本数据集是从arXiv.org收集的学术作品合集。其许可结构如下： * **数据集合集（元数据与打包文件）**：用于构建本数据集的脚本、专属元数据（如文件结构、数据集描述）及其打包形式均采用**MIT许可证（MIT License）**授权。 * **单篇论文（内容）**：每篇论文（PDF/TeX源文件）仍保留原作者选择的版权与许可协议，通常为知识共享（Creative Commons）协议（例如CC BY、CC BY-NC、CC BY-NC-ND）。 * **用户责任**：本数据集的使用者需**自行承担检查、理解并遵守所访问、下载或使用的任何论文的具体许可条款的责任**。您必须按照原作者所选许可协议的要求，为原始作者提供适当的署名。 **使用本数据集即表示您同意上述条款，并确认数据集创建者不对因您使用内含论文所导致的任何许可违规行为承担责任。** 如需更多信息，请参阅： * [arXiv使用条款](https://info.arxiv.org/help/license/index.html) * [知识共享许可协议](https://creativecommons.org/licenses/) ## 🙏 致谢本数据集基于以下团队与个人的卓越工作构建： - **ArXiv** 团队与审稿人 - 各科学领域的**论文作者** - 推动知识共享的**开放科学社区** --- **注意**：由于数据集规模庞大（4.68 TB），建议您下载感兴趣的特定学科分类而非整个数据集。其结构化的组织方式可让定向获取数据变得简单易行。

提供机构：

maas

创建时间：

2025-10-17

搜集汇总

数据集介绍