five

Vidushee/ArXiv-Papers-150K

收藏
Hugging Face2026-03-24 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/Vidushee/ArXiv-Papers-150K
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-nc-4.0 task_categories: - text-generation - document-question-answering language: - en pretty_name: ArXiv Papers 150K size_categories: - 100K<n<1M tags: - arxiv - latex - scientific-papers - machine-learning - computer-vision - nlp - research-papers - llm - rl - latex-source - ocr - arxiv-ml-papers viewer: false --- # ArXiv-Papers-150K 150K+ ArXiv papers as raw LaTeX source archives, covering major AI/ML conferences (2016--2026). ## Overview | | | |---|---| | **Papers** | 150,334 | | **Size** | ~285 GB | | **Format** | `.tar.gz` per paper (original ArXiv source) | | **Years** | 2016 -- 2026 | | **Categories** | cs.LG, cs.CV, cs.CL, cs.AI, stat.ML, cs.NE, cs.SD, eess.AS, cs.RO | ## Category Breakdown | Category | Papers | Description | |---|---|---| | cs.LG | 54,200 | Machine Learning (ICML, NeurIPS, ICLR) | | cs.CV | 35,000 | Computer Vision (CVPR, ECCV, ICCV) | | cs.CL | 25,000 | Computation & Language (ACL, EMNLP) | | cs.AI | 20,000 | Artificial Intelligence (AAAI, IJCAI) | | stat.ML | 7,803 | Statistical Machine Learning | | cs.NE | 2,500 | Neural & Evolutionary Computing (IJCNN) | | cs.RO | 2,500 | Robotics | | cs.SD | 1,500 | Sound | | eess.AS | 1,500 | Audio & Speech | ## Dataset Structure ``` repo/ metadata.parquet # Paper metadata (title, abstract, authors, categories, year, chunk_file) cs.LG/ cs.LG_part_000.tar # ~10GB each, contains individual paper .tar.gz files cs.LG_part_001.tar ... cs.CV/ cs.CV_part_000.tar ... ``` Each `.tar` chunk contains individual paper archives (`.tar.gz`). Each paper archive contains the LaTeX source as submitted to ArXiv: `.tex` files, figures, `.bib` references, style files, etc. ## Metadata `metadata.parquet` contains the following columns for all 150K papers: | Column | Type | Description | |---|---|---| | `paper_id` | string | ArXiv ID (e.g., `2401.12345`) | | `title` | string | Paper title | | `abstract` | string | Abstract | | `authors` | string | Comma-separated author names | | `categories` | string | All ArXiv categories | | `primary_category` | string | Primary ArXiv category | | `published` | string | Publication date | | `year` | int | Year extracted from ArXiv ID | | `chunk_file` | string | Which tar chunk contains this paper | | `doi` | string | DOI if available | | `journal_ref` | string | Journal reference if available | ## Quick Start **Load metadata:** ```python import pandas as pd df = pd.read_parquet("hf://datasets/Vidushee/ArXiv-Papers-150K/metadata.parquet") print(df.shape) # (150334, 12) ``` **Download and extract a category:** ```python from huggingface_hub import hf_hub_download import tarfile, os # Download one chunk path = hf_hub_download( repo_id="Vidushee/ArXiv-Papers-150K", filename="cs.LG/cs.LG_part_000.tar", repo_type="dataset", ) # Extract paper archives with tarfile.open(path) as tar: tar.extractall("./cs_LG_papers/") # Each file is a paper's source archive # Extract a single paper paper = "./cs_LG_papers/2401.12345.tar.gz" with tarfile.open(paper) as t: t.extractall("./paper_source/") # Now you have: main.tex, figures/, references.bib, etc. ``` ## Use Cases - LLM pre-training on scientific text - Scientific document understanding - LaTeX generation and completion - Citation network analysis - Figure extraction and captioning ## Source All papers sourced from [arxiv.org](https://arxiv.org) via the `/e-print/` endpoint. Metadata from [anonymousatom/arxiv-metadata](https://huggingface.co/datasets/anonymousatom/arxiv-metadata). Papers distributed under their original ArXiv licenses. ## Citation ```bibtex @dataset{arxiv_papers_150k, title={ArXiv-Papers-150K: LaTeX Source Archives for 150K AI/ML Papers}, year={2026}, url={https://huggingface.co/datasets/Vidushee/ArXiv-Papers-150K}, note={150K ArXiv paper source archives covering cs.LG, cs.CV, cs.CL, cs.AI, stat.ML and related categories (2016-2026)} } ```
提供机构:
Vidushee
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作