five

CortexEvolved/arxiv-tex-corpus-full

收藏
Hugging Face2026-04-06 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/CortexEvolved/arxiv-tex-corpus-full
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit task_categories: - text-generation language: - en tags: - arxiv - maths - computer-science - physics size_categories: - 100K<n<1M --- <h1 align="center">arxiv-tex-corpus-full (80GB)</h1> <p align="center"> Large-scale LaTeX corpus from arXiv (math, CS, physics, statistics) </p> 📄 Paper: https://arxiv.org/abs/2602.17288 ## 📚 Overview **arxiv-tex-corpus-full (80GB)** is a large-scale dataset of LaTeX source content extracted from papers hosted on arXiv. This version contains approximately **80GB of structured JSONL data**, restricted to the following arXiv categories: * `math` * `cs` * `hep-th` * `hep-ph` * `quant-ph` * `stat.ML` * `stat.TH` The dataset is designed for research in: * Large Language Model (LLM) pretraining * Mathematical reasoning * Theoretical physics modeling * Scientific document modeling * LaTeX language modeling This repository provides the **full 80GB version** of the corpus. ## 📦 Files ``` train.jsonl val.jsonl checkpoint.txt ``` ### `train.jsonl` Training split in JSON Lines format. ### `val.jsonl` Validation split in JSON Lines format. ### `checkpoint.txt` Internal processing checkpoint file used during dataset construction. This file is not required for training. ## 🧾 Data Format Each line in `train.jsonl` and `val.jsonl` is a JSON object. Example schema: ```json { "paper_id": "xxxx.xxxxx", "category": "cs", "latex": "\\documentclass{article} ..." } ``` ### Fields * `paper_id` — arXiv identifier * `category` — one of: * `math` * `cs` * `hep-th` * `hep-ph` * `quant-ph` * `stat.ML` * `stat.TH` * `latex` — extracted LaTeX source content ## 📊 Dataset Statistics * Total size: ~80GB * Format: JSONL * Categories: mathematics, computer science, theoretical physics, quantum physics, and statistics * Snapshot date: *(add date here)* * Deduplication: *(state yes/no)* ## 🧹 Processing Pipeline The dataset was constructed by: 1. Downloading arXiv source archives. 2. Extracting LaTeX source files from eligible papers. 3. Filtering papers to retain only the specified categories. 4. Converting documents into structured JSONL format. 5. Splitting into training and validation sets. Unless otherwise specified: * No semantic cleaning was applied. * No normalization of LaTeX commands was performed. * No compilation validation was enforced. ## ⚖️ Licensing This dataset contains LaTeX source files from papers hosted on arXiv. Each paper retains its original license as specified by its authors on arXiv. Users are responsible for complying with the licensing terms of individual papers. For license details, see: [https://arxiv.org/help/license](https://arxiv.org/help/license) If you filtered by specific licenses (e.g., CC-BY only), please state that here explicitly. ## 🎯 Intended Use This dataset is intended for: * Pretraining or continued training of language models * Research in mathematical and scientific reasoning * Modeling structured scientific documents * Studying LaTeX generation and transformation ## 🚫 Limitations * LaTeX sources may contain compilation errors. * Some source bundles may be incomplete. * Bibliography files may be missing in some cases. * Licensing varies across papers. * Category labeling follows arXiv metadata and may not reflect full topical scope. ## 📌 Citation If you use this dataset, please cite: 1. The original arXiv papers 2. This repository Example: ```bibtex @dataset{arxiv_latex_corpus_2026, title = {arxiv-latex-corpus (80GB)}, year = {2026}, publisher = {Hugging Face}, } ``` ## 🤝 Acknowledgements This dataset is derived from papers made publicly available by authors via arXiv. We thank the research community for openly sharing their work.
提供机构:
CortexEvolved
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作