five

piushorn/arxiv-latex-tables-43k

收藏
Hugging Face2026-03-20 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/piushorn/arxiv-latex-tables-43k
下载链接
链接失效反馈
官方服务:
资源简介:
--- language: - en license: cc-by-nc-sa-4.0 task_categories: - table-question-answering - document-question-answering tags: - latex - tables - arxiv - scientific-documents - table-extraction - document-understanding pretty_name: arXiv LaTeX Tables 43k size_categories: - 10K<n<100K configs: - config_name: default data_files: "*.parquet" --- # arXiv LaTeX Tables 43k A curated collection of **~43k LaTeX tables** extracted from all arXiv papers published in December 2025, classified by structural complexity for training and evaluating table extraction models. ## Key Specifications | Aspect | Details | |--------|---------| | **Size** | 43,651 tables | | **Source** | All arXiv papers published in December 2025 | | **Complexity Classes** | Simple (27,655), Moderate (10,647), Complex (5,349) | | **Compilability** | All tables compile with `pdflatex` | | **License Filter** | Only redistributable CC licenses | | **Format** | Hugging Face Dataset (Parquet) | ## Dataset Structure ```python { "table_id": str, # Unique ID: "{arxiv_id}_table_{n}" "width_pt": float, # Rendered table width in points "height_pt": float, # Rendered table height in points "tabular": str, # LaTeX tabular source code "complexity": str, # "simple", "moderate", or "complex" } ``` ## Quick Start ```python from datasets import load_dataset ds = load_dataset("piushorn/arxiv-latex-tables-43k") sample = ds['train'][0] print(sample['tabular']) # LaTeX source print(sample['complexity']) # e.g., "simple" ``` Efficient column-selective loading via DuckDB (fetches only needed columns from Parquet): ```python import duckdb parquet_url = ( "https://huggingface.co/datasets/piushorn/arxiv-latex-tables-43k/" "resolve/main/data/train-00000-of-00001.parquet" ) con = duckdb.connect() rows = con.execute(f"SELECT tabular, complexity FROM read_parquet('{parquet_url}')").fetchall() ``` ## Complexity Classification Tables are classified into three structural complexity levels based on how well their layout maps to Markdown: | Level | Criteria | Count | |-------|----------|-------| | **Simple** | Regular grid, no cell merging, standard lines | 27,655 | | **Moderate** | Limited merging (header only or horizontal only), partial `\cline` | 10,647 | | **Complex** | Multi-dimensional merging, nested structures, irregular layouts | 5,349 | Classification was performed using GPT-5-mini based on structural analysis of the LaTeX source. ## Extraction Pipeline 1. **Source**: All arXiv papers published in December 2025 with redistributable Creative Commons licenses 2. **Extraction**: `tabular` and `tabular*` environments extracted from LaTeX source files 3. **Cleaning**: Citations (`\cite`, `\ref`, etc.) and captions removed; `\href` simplified to link text 4. **Validation**: Each table compiled with `pdflatex` using standard packages (booktabs, multirow, makecell, etc.) 5. **Measurement**: Page dimensions measured with `pdfinfo` to obtain width/height in points 6. **Classification**: Structural complexity classified by GPT-5-mini ### Compilation Template ```latex \documentclass[varwidth]{standalone} \usepackage[utf8]{inputenc} \usepackage[T1]{fontenc} \usepackage{booktabs,multirow,makecell,graphicx,array} \usepackage{amsmath,amssymb} \usepackage{colortbl} \usepackage[table]{xcolor} \usepackage{adjustbox,caption,diagbox} \usepackage{pifont} \begin{document} % tabular environment inserted here \end{document} ``` Tables that fail to compile with this template are excluded from the dataset. ## Use Cases - **Table extraction evaluation**: Benchmark document parsers on LaTeX table recovery - **Synthetic PDF generation**: Render tables into PDFs with known ground truth - **Table structure recognition**: Train models to understand table layouts - **Complexity-aware evaluation**: Evaluate models separately on simple vs. complex tables ## Citation If you use this dataset in your research or project, please cite our paper: ```bibtex @misc{horn2026benchmarking, title = {Benchmarking PDF Parsers on Table Extraction with LLM-based Semantic Evaluation}, author = {Horn, Pius and Keuper, Janis}, year = {2026}, eprint={2603.18652}, archivePrefix={arXiv}, primaryClass={cs.CV}, url = {https://arxiv.org/abs/2603.18652} } ``` 📄 **Paper:** [arXiv:2603.18652](https://arxiv.org/abs/2603.18652) ### Acknowledgments This work has been supported by the German Federal Ministry of Research, Technology and Space (BMFTR) in the program "Forschung an Fachhochschulen in Kooperation mit Unternehmen (FH-Kooperativ)" within the joint project **LLMpraxis** under grant 13FH622KX2. <p align="center"> <img src="logos/BMFTR_logo.png" alt="BMFTR_logo" width="150" /> <img src="logos/HAW_logo.png" alt="HAW_logo" width="150" /> </p> ### Licensing Information **Content License**: Individual tables retain their original arXiv paper licenses (CC BY 4.0, CC BY-SA 4.0, CC BY-NC-SA 4.0, or CC0 1.0). **Dataset License**: [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) --- **Source**: arXiv papers, December 2025 **Dataset Created**: March 2026

language: - 英语 license: CC BY-NC-SA 4.0 task_categories: - 表格问答 - 文档问答 tags: - LaTeX - 表格 - arXiv - 学术文档 - 表格抽取 - 文档理解 pretty_name: arXiv LaTeX表格43k size_categories: - 10K<n<100K configs: - config_name: default data_files: "*.parquet" --- # arXiv LaTeX表格43k 本数据集为经整理的约4.3万个LaTeX(LaTeX)表格集合,抽取自2025年12月发表的全部arXiv论文,并按结构复杂度进行分类,用于训练与评估表格抽取模型。 ## 关键规格 | 维度 | 详情 | |--------|---------| | **规模** | 43,651个表格 | | **来源** | 2025年12月发表的全部arXiv论文 | | **复杂度类别** | 简单(27,655个)、中等(10,647个)、复杂(5,349个) | | **可编译性** | 所有表格均可通过`pdflatex`编译 | | **许可证筛选** | 仅保留可再分发的知识共享(CC)许可证 | | **格式** | Hugging Face数据集(Parquet格式) | ## 数据集结构 python { "table_id": str, # 唯一标识符:格式为"{arxiv_id}_table_{n}" "width_pt": float, # 渲染后表格宽度,单位为磅(point) "height_pt": float, # 渲染后表格高度,单位为磅(point) "tabular": str, # LaTeX tabular(tabular)环境源代码 "complexity": str, # 复杂度等级,可选值为"simple"(简单)、"moderate"(中等)或"complex"(复杂) } ## 快速入门 python from datasets import load_dataset ds = load_dataset("piushorn/arxiv-latex-tables-43k") sample = ds['train'][0] print(sample['tabular']) # LaTeX源代码 print(sample['complexity']) # 例如:"simple" 通过DuckDB(DuckDB)实现高效的列选择性加载(仅从Parquet文件中获取所需列): python import duckdb parquet_url = ( "https://huggingface.co/datasets/piushorn/arxiv-latex-tables-43k/" "resolve/main/data/train-00000-of-00001.parquet" ) con = duckdb.connect() rows = con.execute(f"SELECT tabular, complexity FROM read_parquet('{parquet_url}')").fetchall() ## 复杂度分类 表格基于其布局与Markdown的适配程度,被划分为三类结构复杂度等级: | 等级 | 判定标准 | 数量 | |-------|----------|-------| | **简单** | 规则网格,无单元格合并,标准线条 | 27,655 | | **中等** | 有限合并(仅表头或仅水平方向),部分使用`cline` | 10,647 | | **复杂** | 多维单元格合并,嵌套结构,不规则布局 | 5,349 | 本次分类基于LaTeX源代码的结构分析,由GPT-5-mini(GPT-5-mini)完成。 ## 抽取流程 1. **来源**:所有持有可再分发知识共享许可证的2025年12月发表的arXiv论文 2. **抽取**:从LaTeX源文件中抽取`tabular`与`tabular*`环境 3. **清洗**:移除引用(`cite`、` ef`等)与图表标题;将`href`简化为链接文本 4. **验证**:使用标准宏包(booktabs、multirow、makecell等)通过`pdflatex`编译每个表格 5. **尺寸测量**:通过`pdfinfo`获取页面尺寸,得到以磅为单位的宽高值 6. **分类**:由GPT-5-mini完成结构复杂度分类 ### 编译模板 latex documentclass[varwidth]{standalone} usepackage[utf8]{inputenc} usepackage[T1]{fontenc} usepackage{booktabs,multirow,makecell,graphicx,array} usepackage{amsmath,amssymb} usepackage{colortbl} usepackage[table]{xcolor} usepackage{adjustbox,caption,diagbox} usepackage{pifont} egin{document} % 此处插入tabular环境 end{document} 无法通过该模板编译的表格将被排除出本数据集。 ## 使用场景 - **表格抽取评估**:针对LaTeX表格恢复任务,基准测试文档解析器 - **合成PDF生成**:将表格渲染为带有已知真值的PDF文档 - **表格结构识别**:训练用于理解表格布局的模型 - **复杂度感知评估**:分别针对简单、中等、复杂表格评估模型性能 ## 引用 若您在研究或项目中使用本数据集,请引用以下论文: bibtex @misc{horn2026benchmarking, title = {Benchmarking PDF Parsers on Table Extraction with LLM-based Semantic Evaluation}, author = {Horn, Pius and Keuper, Janis}, year = {2026}, eprint={2603.18652}, archivePrefix={arXiv}, primaryClass={cs.CV}, url = {https://arxiv.org/abs/2603.18652} } 📄 **论文:** [arXiv:2603.18652](https://arxiv.org/abs/2603.18652) ### 致谢 本研究得到德国联邦研究、技术与空间部(BMFTR)在‘高校与企业合作研究(FH-Kooperativ)’计划下的联合项目**LLMpraxis**(资助号13FH622KX2)的支持。 <p align="center"> <img src="logos/BMFTR_logo.png" alt="BMFTR_logo" width="150" /> <img src="logos/HAW_logo.png" alt="HAW_logo" width="150" /> </p> ### 许可信息 **内容许可证**:单个表格保留其原arXiv论文的许可证(CC BY 4.0、CC BY-SA 4.0、CC BY-NC-SA 4.0或CC0 1.0)。 **数据集许可证**:[CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) --- **数据来源**:2025年12月的arXiv论文 **数据集创建时间**:2026年3月
提供机构:
piushorn
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作