piushorn/arxiv-latex-tables-43k
收藏Hugging Face2026-03-20 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/piushorn/arxiv-latex-tables-43k
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- en
license: cc-by-nc-sa-4.0
task_categories:
- table-question-answering
- document-question-answering
tags:
- latex
- tables
- arxiv
- scientific-documents
- table-extraction
- document-understanding
pretty_name: arXiv LaTeX Tables 43k
size_categories:
- 10K<n<100K
configs:
- config_name: default
data_files: "*.parquet"
---
# arXiv LaTeX Tables 43k
A curated collection of **~43k LaTeX tables** extracted from all arXiv papers published in December 2025, classified by structural complexity for training and evaluating table extraction models.
## Key Specifications
| Aspect | Details |
|--------|---------|
| **Size** | 43,651 tables |
| **Source** | All arXiv papers published in December 2025 |
| **Complexity Classes** | Simple (27,655), Moderate (10,647), Complex (5,349) |
| **Compilability** | All tables compile with `pdflatex` |
| **License Filter** | Only redistributable CC licenses |
| **Format** | Hugging Face Dataset (Parquet) |
## Dataset Structure
```python
{
"table_id": str, # Unique ID: "{arxiv_id}_table_{n}"
"width_pt": float, # Rendered table width in points
"height_pt": float, # Rendered table height in points
"tabular": str, # LaTeX tabular source code
"complexity": str, # "simple", "moderate", or "complex"
}
```
## Quick Start
```python
from datasets import load_dataset
ds = load_dataset("piushorn/arxiv-latex-tables-43k")
sample = ds['train'][0]
print(sample['tabular']) # LaTeX source
print(sample['complexity']) # e.g., "simple"
```
Efficient column-selective loading via DuckDB (fetches only needed columns from Parquet):
```python
import duckdb
parquet_url = (
"https://huggingface.co/datasets/piushorn/arxiv-latex-tables-43k/"
"resolve/main/data/train-00000-of-00001.parquet"
)
con = duckdb.connect()
rows = con.execute(f"SELECT tabular, complexity FROM read_parquet('{parquet_url}')").fetchall()
```
## Complexity Classification
Tables are classified into three structural complexity levels based on how well their layout maps to Markdown:
| Level | Criteria | Count |
|-------|----------|-------|
| **Simple** | Regular grid, no cell merging, standard lines | 27,655 |
| **Moderate** | Limited merging (header only or horizontal only), partial `\cline` | 10,647 |
| **Complex** | Multi-dimensional merging, nested structures, irregular layouts | 5,349 |
Classification was performed using GPT-5-mini based on structural analysis of the LaTeX source.
## Extraction Pipeline
1. **Source**: All arXiv papers published in December 2025 with redistributable Creative Commons licenses
2. **Extraction**: `tabular` and `tabular*` environments extracted from LaTeX source files
3. **Cleaning**: Citations (`\cite`, `\ref`, etc.) and captions removed; `\href` simplified to link text
4. **Validation**: Each table compiled with `pdflatex` using standard packages (booktabs, multirow, makecell, etc.)
5. **Measurement**: Page dimensions measured with `pdfinfo` to obtain width/height in points
6. **Classification**: Structural complexity classified by GPT-5-mini
### Compilation Template
```latex
\documentclass[varwidth]{standalone}
\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}
\usepackage{booktabs,multirow,makecell,graphicx,array}
\usepackage{amsmath,amssymb}
\usepackage{colortbl}
\usepackage[table]{xcolor}
\usepackage{adjustbox,caption,diagbox}
\usepackage{pifont}
\begin{document}
% tabular environment inserted here
\end{document}
```
Tables that fail to compile with this template are excluded from the dataset.
## Use Cases
- **Table extraction evaluation**: Benchmark document parsers on LaTeX table recovery
- **Synthetic PDF generation**: Render tables into PDFs with known ground truth
- **Table structure recognition**: Train models to understand table layouts
- **Complexity-aware evaluation**: Evaluate models separately on simple vs. complex tables
## Citation
If you use this dataset in your research or project, please cite our paper:
```bibtex
@misc{horn2026benchmarking,
title = {Benchmarking PDF Parsers on Table Extraction with LLM-based Semantic Evaluation},
author = {Horn, Pius and Keuper, Janis},
year = {2026},
eprint={2603.18652},
archivePrefix={arXiv},
primaryClass={cs.CV},
url = {https://arxiv.org/abs/2603.18652}
}
```
📄 **Paper:** [arXiv:2603.18652](https://arxiv.org/abs/2603.18652)
### Acknowledgments
This work has been supported by the German Federal Ministry of Research, Technology and Space (BMFTR) in the program "Forschung an Fachhochschulen in Kooperation mit Unternehmen (FH-Kooperativ)" within the joint project **LLMpraxis** under grant 13FH622KX2.
<p align="center">
<img src="logos/BMFTR_logo.png" alt="BMFTR_logo" width="150" />
<img src="logos/HAW_logo.png" alt="HAW_logo" width="150" />
</p>
### Licensing Information
**Content License**: Individual tables retain their original arXiv paper licenses (CC BY 4.0, CC BY-SA 4.0, CC BY-NC-SA 4.0, or CC0 1.0).
**Dataset License**: [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/)
---
**Source**: arXiv papers, December 2025
**Dataset Created**: March 2026
language:
- 英语
license: CC BY-NC-SA 4.0
task_categories:
- 表格问答
- 文档问答
tags:
- LaTeX
- 表格
- arXiv
- 学术文档
- 表格抽取
- 文档理解
pretty_name: arXiv LaTeX表格43k
size_categories:
- 10K<n<100K
configs:
- config_name: default
data_files: "*.parquet"
---
# arXiv LaTeX表格43k
本数据集为经整理的约4.3万个LaTeX(LaTeX)表格集合,抽取自2025年12月发表的全部arXiv论文,并按结构复杂度进行分类,用于训练与评估表格抽取模型。
## 关键规格
| 维度 | 详情 |
|--------|---------|
| **规模** | 43,651个表格 |
| **来源** | 2025年12月发表的全部arXiv论文 |
| **复杂度类别** | 简单(27,655个)、中等(10,647个)、复杂(5,349个) |
| **可编译性** | 所有表格均可通过`pdflatex`编译 |
| **许可证筛选** | 仅保留可再分发的知识共享(CC)许可证 |
| **格式** | Hugging Face数据集(Parquet格式) |
## 数据集结构
python
{
"table_id": str, # 唯一标识符:格式为"{arxiv_id}_table_{n}"
"width_pt": float, # 渲染后表格宽度,单位为磅(point)
"height_pt": float, # 渲染后表格高度,单位为磅(point)
"tabular": str, # LaTeX tabular(tabular)环境源代码
"complexity": str, # 复杂度等级,可选值为"simple"(简单)、"moderate"(中等)或"complex"(复杂)
}
## 快速入门
python
from datasets import load_dataset
ds = load_dataset("piushorn/arxiv-latex-tables-43k")
sample = ds['train'][0]
print(sample['tabular']) # LaTeX源代码
print(sample['complexity']) # 例如:"simple"
通过DuckDB(DuckDB)实现高效的列选择性加载(仅从Parquet文件中获取所需列):
python
import duckdb
parquet_url = (
"https://huggingface.co/datasets/piushorn/arxiv-latex-tables-43k/"
"resolve/main/data/train-00000-of-00001.parquet"
)
con = duckdb.connect()
rows = con.execute(f"SELECT tabular, complexity FROM read_parquet('{parquet_url}')").fetchall()
## 复杂度分类
表格基于其布局与Markdown的适配程度,被划分为三类结构复杂度等级:
| 等级 | 判定标准 | 数量 |
|-------|----------|-------|
| **简单** | 规则网格,无单元格合并,标准线条 | 27,655 |
| **中等** | 有限合并(仅表头或仅水平方向),部分使用`cline` | 10,647 |
| **复杂** | 多维单元格合并,嵌套结构,不规则布局 | 5,349 |
本次分类基于LaTeX源代码的结构分析,由GPT-5-mini(GPT-5-mini)完成。
## 抽取流程
1. **来源**:所有持有可再分发知识共享许可证的2025年12月发表的arXiv论文
2. **抽取**:从LaTeX源文件中抽取`tabular`与`tabular*`环境
3. **清洗**:移除引用(`cite`、`
ef`等)与图表标题;将`href`简化为链接文本
4. **验证**:使用标准宏包(booktabs、multirow、makecell等)通过`pdflatex`编译每个表格
5. **尺寸测量**:通过`pdfinfo`获取页面尺寸,得到以磅为单位的宽高值
6. **分类**:由GPT-5-mini完成结构复杂度分类
### 编译模板
latex
documentclass[varwidth]{standalone}
usepackage[utf8]{inputenc}
usepackage[T1]{fontenc}
usepackage{booktabs,multirow,makecell,graphicx,array}
usepackage{amsmath,amssymb}
usepackage{colortbl}
usepackage[table]{xcolor}
usepackage{adjustbox,caption,diagbox}
usepackage{pifont}
egin{document}
% 此处插入tabular环境
end{document}
无法通过该模板编译的表格将被排除出本数据集。
## 使用场景
- **表格抽取评估**:针对LaTeX表格恢复任务,基准测试文档解析器
- **合成PDF生成**:将表格渲染为带有已知真值的PDF文档
- **表格结构识别**:训练用于理解表格布局的模型
- **复杂度感知评估**:分别针对简单、中等、复杂表格评估模型性能
## 引用
若您在研究或项目中使用本数据集,请引用以下论文:
bibtex
@misc{horn2026benchmarking,
title = {Benchmarking PDF Parsers on Table Extraction with LLM-based Semantic Evaluation},
author = {Horn, Pius and Keuper, Janis},
year = {2026},
eprint={2603.18652},
archivePrefix={arXiv},
primaryClass={cs.CV},
url = {https://arxiv.org/abs/2603.18652}
}
📄 **论文:** [arXiv:2603.18652](https://arxiv.org/abs/2603.18652)
### 致谢
本研究得到德国联邦研究、技术与空间部(BMFTR)在‘高校与企业合作研究(FH-Kooperativ)’计划下的联合项目**LLMpraxis**(资助号13FH622KX2)的支持。
<p align="center">
<img src="logos/BMFTR_logo.png" alt="BMFTR_logo" width="150" />
<img src="logos/HAW_logo.png" alt="HAW_logo" width="150" />
</p>
### 许可信息
**内容许可证**:单个表格保留其原arXiv论文的许可证(CC BY 4.0、CC BY-SA 4.0、CC BY-NC-SA 4.0或CC0 1.0)。
**数据集许可证**:[CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/)
---
**数据来源**:2025年12月的arXiv论文
**数据集创建时间**:2026年3月
提供机构:
piushorn



