CortexEvolved/arxiv-tex-corpus-full

Name: CortexEvolved/arxiv-tex-corpus-full
Creator: CortexEvolved
Published: 2026-04-06 11:41:26
License: 暂无描述

Hugging Face2026-04-06 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/CortexEvolved/arxiv-tex-corpus-full

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit task_categories: - text-generation language: - en tags: - arxiv - maths - computer-science - physics size_categories: - 100K<n<1M --- <h1 align="center">arxiv-tex-corpus-full (80GB)</h1> <p align="center"> Large-scale LaTeX corpus from arXiv (math, CS, physics, statistics) </p> 📄 Paper: https://arxiv.org/abs/2602.17288 ## 📚 Overview **arxiv-tex-corpus-full (80GB)** is a large-scale dataset of LaTeX source content extracted from papers hosted on arXiv. This version contains approximately **80GB of structured JSONL data**, restricted to the following arXiv categories: * `math` * `cs` * `hep-th` * `hep-ph` * `quant-ph` * `stat.ML` * `stat.TH` The dataset is designed for research in: * Large Language Model (LLM) pretraining * Mathematical reasoning * Theoretical physics modeling * Scientific document modeling * LaTeX language modeling This repository provides the **full 80GB version** of the corpus. ## 📦 Files ``` train.jsonl val.jsonl checkpoint.txt ``` ### `train.jsonl` Training split in JSON Lines format. ### `val.jsonl` Validation split in JSON Lines format. ### `checkpoint.txt` Internal processing checkpoint file used during dataset construction. This file is not required for training. ## 🧾 Data Format Each line in `train.jsonl` and `val.jsonl` is a JSON object. Example schema: ```json { "paper_id": "xxxx.xxxxx", "category": "cs", "latex": "\\documentclass{article} ..." } ``` ### Fields * `paper_id` — arXiv identifier * `category` — one of: * `math` * `cs` * `hep-th` * `hep-ph` * `quant-ph` * `stat.ML` * `stat.TH` * `latex` — extracted LaTeX source content ## 📊 Dataset Statistics * Total size: ~80GB * Format: JSONL * Categories: mathematics, computer science, theoretical physics, quantum physics, and statistics * Snapshot date: *(add date here)* * Deduplication: *(state yes/no)* ## 🧹 Processing Pipeline The dataset was constructed by: 1. Downloading arXiv source archives. 2. Extracting LaTeX source files from eligible papers. 3. Filtering papers to retain only the specified categories. 4. Converting documents into structured JSONL format. 5. Splitting into training and validation sets. Unless otherwise specified: * No semantic cleaning was applied. * No normalization of LaTeX commands was performed. * No compilation validation was enforced. ## ⚖️ Licensing This dataset contains LaTeX source files from papers hosted on arXiv. Each paper retains its original license as specified by its authors on arXiv. Users are responsible for complying with the licensing terms of individual papers. For license details, see: [https://arxiv.org/help/license](https://arxiv.org/help/license) If you filtered by specific licenses (e.g., CC-BY only), please state that here explicitly. ## 🎯 Intended Use This dataset is intended for: * Pretraining or continued training of language models * Research in mathematical and scientific reasoning * Modeling structured scientific documents * Studying LaTeX generation and transformation ## 🚫 Limitations * LaTeX sources may contain compilation errors. * Some source bundles may be incomplete. * Bibliography files may be missing in some cases. * Licensing varies across papers. * Category labeling follows arXiv metadata and may not reflect full topical scope. ## 📌 Citation If you use this dataset, please cite: 1. The original arXiv papers 2. This repository Example: ```bibtex @dataset{arxiv_latex_corpus_2026, title = {arxiv-latex-corpus (80GB)}, year = {2026}, publisher = {Hugging Face}, } ``` ## 🤝 Acknowledgements This dataset is derived from papers made publicly available by authors via arXiv. We thank the research community for openly sharing their work.

提供机构：

CortexEvolved

5,000+

优质数据集

54 个

任务类型

进入经典数据集