piushorn/arxiv-latex-tables-43k

Name: piushorn/arxiv-latex-tables-43k
Creator: piushorn
Published: 2026-03-20 13:07:38
License: 暂无描述

Hugging Face2026-03-20 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/piushorn/arxiv-latex-tables-43k

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en license: cc-by-nc-sa-4.0 task_categories: - table-question-answering - document-question-answering tags: - latex - tables - arxiv - scientific-documents - table-extraction - document-understanding pretty_name: arXiv LaTeX Tables 43k size_categories: - 10K<n<100K configs: - config_name: default data_files: "*.parquet" --- # arXiv LaTeX Tables 43k A curated collection of **~43k LaTeX tables** extracted from all arXiv papers published in December 2025, classified by structural complexity for training and evaluating table extraction models. ## Key Specifications | Aspect | Details | |--------|---------| | **Size** | 43,651 tables | | **Source** | All arXiv papers published in December 2025 | | **Complexity Classes** | Simple (27,655), Moderate (10,647), Complex (5,349) | | **Compilability** | All tables compile with `pdflatex` | | **License Filter** | Only redistributable CC licenses | | **Format** | Hugging Face Dataset (Parquet) | ## Dataset Structure ```python { "table_id": str, # Unique ID: "{arxiv_id}_table_{n}" "width_pt": float, # Rendered table width in points "height_pt": float, # Rendered table height in points "tabular": str, # LaTeX tabular source code "complexity": str, # "simple", "moderate", or "complex" } ``` ## Quick Start ```python from datasets import load_dataset ds = load_dataset("piushorn/arxiv-latex-tables-43k") sample = ds['train'][0] print(sample['tabular']) # LaTeX source print(sample['complexity']) # e.g., "simple" ``` Efficient column-selective loading via DuckDB (fetches only needed columns from Parquet): ```python import duckdb parquet_url = ( "https://huggingface.co/datasets/piushorn/arxiv-latex-tables-43k/" "resolve/main/data/train-00000-of-00001.parquet" ) con = duckdb.connect() rows = con.execute(f"SELECT tabular, complexity FROM read_parquet('{parquet_url}')").fetchall() ``` ## Complexity Classification Tables are classified into three structural complexity levels based on how well their layout maps to Markdown: | Level | Criteria | Count | |-------|----------|-------| | **Simple** | Regular grid, no cell merging, standard lines | 27,655 | | **Moderate** | Limited merging (header only or horizontal only), partial `\cline` | 10,647 | | **Complex** | Multi-dimensional merging, nested structures, irregular layouts | 5,349 | Classification was performed using GPT-5-mini based on structural analysis of the LaTeX source. ## Extraction Pipeline 1. **Source**: All arXiv papers published in December 2025 with redistributable Creative Commons licenses 2. **Extraction**: `tabular` and `tabular*` environments extracted from LaTeX source files 3. **Cleaning**: Citations (`\cite`, `\ref`, etc.) and captions removed; `\href` simplified to link text 4. **Validation**: Each table compiled with `pdflatex` using standard packages (booktabs, multirow, makecell, etc.) 5. **Measurement**: Page dimensions measured with `pdfinfo` to obtain width/height in points 6. **Classification**: Structural complexity classified by GPT-5-mini ### Compilation Template ```latex \documentclass[varwidth]{standalone} \usepackage[utf8]{inputenc} \usepackage[T1]{fontenc} \usepackage{booktabs,multirow,makecell,graphicx,array} \usepackage{amsmath,amssymb} \usepackage{colortbl} \usepackage[table]{xcolor} \usepackage{adjustbox,caption,diagbox} \usepackage{pifont} \begin{document} % tabular environment inserted here \end{document} ``` Tables that fail to compile with this template are excluded from the dataset. ## Use Cases - **Table extraction evaluation**: Benchmark document parsers on LaTeX table recovery - **Synthetic PDF generation**: Render tables into PDFs with known ground truth - **Table structure recognition**: Train models to understand table layouts - **Complexity-aware evaluation**: Evaluate models separately on simple vs. complex tables ## Citation If you use this dataset in your research or project, please cite our paper: ```bibtex @misc{horn2026benchmarking, title = {Benchmarking PDF Parsers on Table Extraction with LLM-based Semantic Evaluation}, author = {Horn, Pius and Keuper, Janis}, year = {2026}, eprint={2603.18652}, archivePrefix={arXiv}, primaryClass={cs.CV}, url = {https://arxiv.org/abs/2603.18652} } ``` 📄 **Paper:** [arXiv:2603.18652](https://arxiv.org/abs/2603.18652) ### Acknowledgments This work has been supported by the German Federal Ministry of Research, Technology and Space (BMFTR) in the program "Forschung an Fachhochschulen in Kooperation mit Unternehmen (FH-Kooperativ)" within the joint project **LLMpraxis** under grant 13FH622KX2. <p align="center"> <img src="logos/BMFTR_logo.png" alt="BMFTR_logo" width="150" /> <img src="logos/HAW_logo.png" alt="HAW_logo" width="150" /> </p> ### Licensing Information **Content License**: Individual tables retain their original arXiv paper licenses (CC BY 4.0, CC BY-SA 4.0, CC BY-NC-SA 4.0, or CC0 1.0). **Dataset License**: [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) --- **Source**: arXiv papers, December 2025 **Dataset Created**: March 2026

language: - 英语 license: CC BY-NC-SA 4.0 task_categories: - 表格问答 - 文档问答 tags: - LaTeX - 表格 - arXiv - 学术文档 - 表格抽取 - 文档理解 pretty_name: arXiv LaTeX表格43k size_categories: - 10K<n<100K configs: - config_name: default data_files: "*.parquet" --- # arXiv LaTeX表格43k 本数据集为经整理的约4.3万个LaTeX（LaTeX）表格集合，抽取自2025年12月发表的全部arXiv论文，并按结构复杂度进行分类，用于训练与评估表格抽取模型。 ## 关键规格 | 维度 | 详情 | |--------|---------| | **规模** | 43,651个表格 | | **来源** | 2025年12月发表的全部arXiv论文 | | **复杂度类别** | 简单（27,655个）、中等（10,647个）、复杂（5,349个） | | **可编译性** | 所有表格均可通过`pdflatex`编译 | | **许可证筛选** | 仅保留可再分发的知识共享（CC）许可证 | | **格式** | Hugging Face数据集（Parquet格式） | ## 数据集结构 python { "table_id": str, # 唯一标识符：格式为"{arxiv_id}_table_{n}" "width_pt": float, # 渲染后表格宽度，单位为磅（point） "height_pt": float, # 渲染后表格高度，单位为磅（point） "tabular": str, # LaTeX tabular（tabular）环境源代码 "complexity": str, # 复杂度等级，可选值为"simple"（简单）、"moderate"（中等）或"complex"（复杂） } ## 快速入门 python from datasets import load_dataset ds = load_dataset("piushorn/arxiv-latex-tables-43k") sample = ds['train'][0] print(sample['tabular']) # LaTeX源代码 print(sample['complexity']) # 例如："simple" 通过DuckDB（DuckDB）实现高效的列选择性加载（仅从Parquet文件中获取所需列）： python import duckdb parquet_url = ( "https://huggingface.co/datasets/piushorn/arxiv-latex-tables-43k/" "resolve/main/data/train-00000-of-00001.parquet" ) con = duckdb.connect() rows = con.execute(f"SELECT tabular, complexity FROM read_parquet('{parquet_url}')").fetchall() ## 复杂度分类表格基于其布局与Markdown的适配程度，被划分为三类结构复杂度等级： | 等级 | 判定标准 | 数量 | |-------|----------|-------| | **简单** | 规则网格，无单元格合并，标准线条 | 27,655 | | **中等** | 有限合并（仅表头或仅水平方向），部分使用`cline` | 10,647 | | **复杂** | 多维单元格合并，嵌套结构，不规则布局 | 5,349 | 本次分类基于LaTeX源代码的结构分析，由GPT-5-mini（GPT-5-mini）完成。 ## 抽取流程 1. **来源**：所有持有可再分发知识共享许可证的2025年12月发表的arXiv论文 2. **抽取**：从LaTeX源文件中抽取`tabular`与`tabular*`环境 3. **清洗**：移除引用（`cite`、` ef`等）与图表标题；将`href`简化为链接文本 4. **验证**：使用标准宏包（booktabs、multirow、makecell等）通过`pdflatex`编译每个表格 5. **尺寸测量**：通过`pdfinfo`获取页面尺寸，得到以磅为单位的宽高值 6. **分类**：由GPT-5-mini完成结构复杂度分类 ### 编译模板 latex documentclass[varwidth]{standalone} usepackage[utf8]{inputenc} usepackage[T1]{fontenc} usepackage{booktabs,multirow,makecell,graphicx,array} usepackage{amsmath,amssymb} usepackage{colortbl} usepackage[table]{xcolor} usepackage{adjustbox,caption,diagbox} usepackage{pifont} egin{document} % 此处插入tabular环境 end{document} 无法通过该模板编译的表格将被排除出本数据集。 ## 使用场景 - **表格抽取评估**：针对LaTeX表格恢复任务，基准测试文档解析器 - **合成PDF生成**：将表格渲染为带有已知真值的PDF文档 - **表格结构识别**：训练用于理解表格布局的模型 - **复杂度感知评估**：分别针对简单、中等、复杂表格评估模型性能 ## 引用若您在研究或项目中使用本数据集，请引用以下论文： bibtex @misc{horn2026benchmarking, title = {Benchmarking PDF Parsers on Table Extraction with LLM-based Semantic Evaluation}, author = {Horn, Pius and Keuper, Janis}, year = {2026}, eprint={2603.18652}, archivePrefix={arXiv}, primaryClass={cs.CV}, url = {https://arxiv.org/abs/2603.18652} } 📄 **论文：** [arXiv:2603.18652](https://arxiv.org/abs/2603.18652) ### 致谢本研究得到德国联邦研究、技术与空间部（BMFTR）在‘高校与企业合作研究（FH-Kooperativ）’计划下的联合项目**LLMpraxis**（资助号13FH622KX2）的支持。 <p align="center"> <img src="logos/BMFTR_logo.png" alt="BMFTR_logo" width="150" /> <img src="logos/HAW_logo.png" alt="HAW_logo" width="150" /> </p> ### 许可信息 **内容许可证**：单个表格保留其原arXiv论文的许可证（CC BY 4.0、CC BY-SA 4.0、CC BY-NC-SA 4.0或CC0 1.0）。 **数据集许可证**：[CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) --- **数据来源**：2025年12月的arXiv论文 **数据集创建时间**：2026年3月

提供机构：

piushorn

5,000+

优质数据集

54 个

任务类型

进入经典数据集