rohanSingh969/PubTables-v2
收藏Hugging Face2026-04-13 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/rohanSingh969/PubTables-v2
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cdla-permissive-2.0
task_categories:
- image-to-text
- object-detection
tags:
- table-extraction
- document-understanding
---
# PubTables-v2
*PubTables-v2* is a new large-scale dataset for full-page and multi-page table extraction.
<img src="https://cdn-uploads.huggingface.co/production/uploads/631b793ff6bc4be4a65092ed/w9BXjqpGduqiPZ-QZCfYO.jpeg" alt="PubTables-v2 Figure 2" width="800">
See also: [Hugging Face Paper Page](https://huggingface.co/papers/2512.10888)
## News
```2026 Mar 17```: New [paper](https://arxiv.org/abs/2512.10888) draft with more experiments, especially for multi-page table extraction
```2026 Feb 11```: PubTables-v2 has been officially released on Hugging Face!
```2025 Dec 11```: Our [paper](https://arxiv.org/abs/2512.10888) is now available on arXiv
## Collections
PubTables-v2 comes in 3 collections.
Each collection contains tables in a specific context: cropped tables (essentially context-free), tables within a full page context, and tables within a full document context.
### Cropped Tables
<img src="https://cdn-uploads.huggingface.co/production/uploads/631b793ff6bc4be4a65092ed/1ErEFtJ1e6Zx-HBSgkVob.jpeg" alt="PubTables-v2 Figure 3" width="600">
- **135,578** cropped tables, with each sample containing exactly one table cropped closely to its border
- In contrast with PubTables-1M, all of the cropped tables in this collection are either long (30 or more rows) or wide (12 or more columns)
- The data in this collection supports the traditional table structure recognition (TSR) task
- Note: **5,804** of these tables are currently part of an unreleased, private test set
### Single Pages
<img src="https://cdn-uploads.huggingface.co/production/uploads/631b793ff6bc4be4a65092ed/me7ogRAehkyz_R5zE5vIi.jpeg" alt="PubTables-v2 Figure 1" width="300">
- **467,541** individual document pages, with each page containing 1 or more tables
- This collection contains **548,414** tables annotated in their full-page context
- The data in this collection supports the task of TD and the task of end-to-end table extraction (TE), or TSR in a full-page context
- Note: **43,288** of these pages are currently part of an unreleased, private test set
### Full Documents
<img src="https://cdn-uploads.huggingface.co/production/uploads/631b793ff6bc4be4a65092ed/w9BXjqpGduqiPZ-QZCfYO.jpeg" alt="PubTables-v2 Figure 2" width="500">
- **9,172** full documents, with all pages included and all tables annotated
- This collection focuses on multi-page tables specifically—every document contains at least one multi-page table
- In total there are **9,492** multi-page tables, **630** single-page tables split across multiple columns, and **14,740** single-page, single-part tables
- The data in this collection supports the tasks of multi-page TD, multi-page TSR (cell structure and content but not location), and cross-page table continuation prediction
Note: A small percentage of the samples in PubTables-v2 are currently part of a private, unreleased test set.
These samples are held back to evaluate data leakage for the public test set in future models and will be released at some point in the future.
## Download
Currently we support downloading the dataset as a collection of tar.gz files.
Please switch to the "Files and versions" tab to download all of the files or use a command such as wget to download from the command line.
On your machine, make a directory called "PubTables-v2" (or whatever you prefer) to download the files into.
Once downloaded, use the included script "uncompress.sh" to extract and organize all of the data within that folder.
If you only want to uncompress files from a particular collection, such as the Single Pages collection, use the included script with that collection in the name.
license: CDLA允许型2.0许可(cdla-permissive-2.0)
task_categories:
- 图像到文本(image-to-text)
- 目标检测(object-detection)
tags:
- 表格提取(table-extraction)
- 文档理解(document-understanding)
# PubTables-v2
**PubTables-v2**是一款面向全页与多页表格提取的大规模新型数据集。

相关链接:[Hugging Face 论文页面](https://huggingface.co/papers/2512.10888)
## 动态更新
2026年3月17日: 新增[论文](https://arxiv.org/abs/2512.10888)预印本,补充了更多实验内容,尤其针对多页表格提取任务
2026年2月11日: PubTables-v2 已在 Hugging Face 平台正式发布!
2025年12月11日: 我们的[论文](https://arxiv.org/abs/2512.10888)现已在 arXiv 平台上线
## 数据集子集
PubTables-v2 包含三大数据集子集,每个子集对应特定的上下文场景:裁剪表格(本质上无上下文)、全页上下文内的表格,以及完整文档上下文内的表格。
### 裁剪表格子集

- **135,578** 张裁剪表格,每个样本均包含一张严格按边框裁剪的单张表格
- 与 PubTables-1M 不同,该子集内的所有裁剪表格均为长表格(行数≥30)或宽表格(列数≥12)
- 该子集的数据可支撑传统表格结构识别(Table Structure Recognition, TSR)任务
- 注意:其中**5,804**张表格目前属于未公开的私有测试集
### 单页文档子集

- **467,541** 份独立文档页面,每份页面包含1张或多张表格
- 该子集共标注了**548,414**张处于全页上下文内的表格
- 该子集的数据可支撑表格检测(Table Detection, TD)任务、端到端表格提取(End-to-End Table Extraction, TE)任务,以及全页上下文下的表格结构识别任务
- 注意:其中**43,288**份页面目前属于未公开的私有测试集
### 完整文档子集

- **9,172** 份完整文档,包含所有页面且所有表格均已标注
- 该子集专门针对多页表格设计——每份文档均包含至少一张多页表格
- 总计包含**9,492**张多页表格、**630**张跨多栏分布的单页表格,以及**14,740**张单页单块表格
- 该子集的数据可支撑多页表格检测(multi-page Table Detection, multi-page TD)、多页表格结构识别(multi-page Table Structure Recognition, multi-page TSR,仅涵盖单元格结构与内容,不包含位置信息),以及跨页表格续接预测任务
注意:PubTables-v2 中有少量样本目前属于未公开的私有测试集。这些样本被预留用于评估未来模型在公开测试集上的数据泄露风险,后续将正式公开。
## 下载方式
目前我们支持以 tar.gz 压缩包集合的形式下载该数据集。
请切换至 "Files and versions" 标签页下载全部文件,或通过命令行使用 wget 等指令进行下载。
在本地设备创建名为 "PubTables-v2" 的目录(或自定义名称),用于存放下载的文件。
下载完成后,使用内置脚本 "uncompress.sh" 对该目录下的所有数据进行解压与整理。
若仅需解压特定子集的文件(例如单页文档子集),可使用对应子集名称的内置脚本完成操作。
提供机构:
rohanSingh969



