PubTables-1M (PubMed Tables One Million)
收藏OpenDataLab2026-05-24 更新2024-05-09 收录
下载链接:
https://opendatalab.org.cn/OpenDataLab/PubTables-1M
下载链接
链接失效反馈官方服务:
资源简介:
PubTables-1M 的目标是创建一个大型、详细、高质量的数据集,用于训练和评估用于表格检测、表格结构识别和功能分析任务的各种模型。它包含:460,589 个带注释的文档页面,其中包含用于表格检测的表格。 947,642 个完整注释的表格,包括文本内容和完整的位置(边界框)信息,用于表格结构识别和功能分析。所有表格行、列和单元格(包括空白单元格)以及其他注释结构(例如列标题和投影行标题)的图像和 PDF 坐标中的完整边界框。所有表格和页面的渲染图像。每个表格和页面图像中出现的所有单词的边界框和文本。当前模型训练中未使用的其他单元格属性。此外,标题中的单元格被规范化,我们实施了多个质量控制步骤,以确保注释尽可能没有噪音。有关详细信息,请参阅我们的论文。
The goal of PubTables-1M is to create a large, detailed, high-quality dataset for training and evaluating various models for table detection, table structure recognition, and functional analysis tasks. It contains: 460,589 annotated document pages with tables for table detection; 947,642 fully annotated tables, including textual content and complete positional (bounding box) information for table structure recognition and functional analysis. Full bounding boxes in both image and PDF coordinates are provided for all table rows, columns, cells (including blank cells), and other annotated structures such as column headers and projected row headers. Rendered images for all tables and pages are included. Bounding boxes and transcriptions of all words appearing in each table and page image are provided as well. Additional cell attributes not used in current model training are also part of the dataset. Additionally, cells within headers are normalized, and multiple quality control steps have been implemented to ensure annotations are as free of noise as possible. For more details, please refer to our paper.
提供机构:
OpenDataLab
创建时间:
2022-08-16
搜集汇总
数据集介绍

背景与挑战
背景概述
PubTables-1M是一个大规模、高质量的表格提取数据集,专门用于训练和评估表格检测、结构识别和功能分析模型。它包含约46万个文档页面和94.7万个完整注释的表格,提供详细的边界框、文本内容及单元格属性等注释信息,并经过严格质量控制以确保低噪音。该数据集由微软于2021年发布,旨在推动从非结构化文档中全面提取表格的研究。
以上内容由遇见数据集搜集并总结生成



