ds4sd/DocLayNet-v1.1

Name: ds4sd/DocLayNet-v1.1
Creator: ds4sd
Published: 2023-09-01 09:58:52
License: 暂无描述

Hugging Face2023-09-01 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/ds4sd/DocLayNet-v1.1

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - crowdsourced license: other pretty_name: DocLayNet size_categories: - 10K<n<100K tags: - layout-segmentation - COCO - document-understanding - PDF task_categories: - object-detection - image-segmentation task_ids: - instance-segmentation dataset_info: features: - name: image dtype: image - name: bboxes sequence: sequence: float64 - name: category_id sequence: int64 - name: segmentation sequence: sequence: sequence: float64 - name: area sequence: float64 - name: pdf_cells list: list: - name: bbox sequence: float64 - name: font struct: - name: color sequence: int64 - name: name dtype: string - name: size dtype: float64 - name: text dtype: string - name: metadata struct: - name: coco_height dtype: int64 - name: coco_width dtype: int64 - name: collection dtype: string - name: doc_category dtype: string - name: image_id dtype: int64 - name: num_pages dtype: int64 - name: original_filename dtype: string - name: original_height dtype: float64 - name: original_width dtype: float64 - name: page_hash dtype: string - name: page_no dtype: int64 splits: - name: train num_bytes: 28172005254.125 num_examples: 69375 - name: test num_bytes: 1996179229.125 num_examples: 4999 - name: val num_bytes: 2493896901.875 num_examples: 6489 download_size: 7766115331 dataset_size: 32662081385.125 --- # Dataset Card for DocLayNet v1.1 ## Table of Contents - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Dataset Structure](#dataset-structure) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Annotations](#annotations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** https://developer.ibm.com/exchanges/data/all/doclaynet/ - **Repository:** https://github.com/DS4SD/DocLayNet - **Paper:** https://doi.org/10.1145/3534678.3539043 ### Dataset Summary DocLayNet provides page-by-page layout segmentation ground-truth using bounding-boxes for 11 distinct class labels on 80863 unique pages from 6 document categories. It provides several unique features compared to related work such as PubLayNet or DocBank: 1. *Human Annotation*: DocLayNet is hand-annotated by well-trained experts, providing a gold-standard in layout segmentation through human recognition and interpretation of each page layout 2. *Large layout variability*: DocLayNet includes diverse and complex layouts from a large variety of public sources in Finance, Science, Patents, Tenders, Law texts and Manuals 3. *Detailed label set*: DocLayNet defines 11 class labels to distinguish layout features in high detail. 4. *Redundant annotations*: A fraction of the pages in DocLayNet are double- or triple-annotated, allowing to estimate annotation uncertainty and an upper-bound of achievable prediction accuracy with ML models 5. *Pre-defined train- test- and validation-sets*: DocLayNet provides fixed sets for each to ensure proportional representation of the class-labels and avoid leakage of unique layout styles across the sets. ## Dataset Structure This dataset is structured differently from the other repository [ds4sd/DocLayNet](https://huggingface.co/datasets/ds4sd/DocLayNet), as this one includes the content (PDF cells) of the detections, and abandons the COCO format. * `image`: page PIL image. * `bboxes`: a list of layout bounding boxes. * `category_id`: a list of class ids corresponding to the bounding boxes. * `segmentation`: a list of layout segmentation polygons. * `pdf_cells`: a list of lists corresponding to `bbox`. Each list contains the PDF cells (content) inside the bbox. * `metadata`: page and document metadetails. Bounding boxes classes / categories: ``` 1: Caption 2: Footnote 3: Formula 4: List-item 5: Page-footer 6: Page-header 7: Picture 8: Section-header 9: Table 10: Text 11: Title ``` The `["metadata"]["doc_category"]` field uses one of the following constants: ``` * financial_reports, * scientific_articles, * laws_and_regulations, * government_tenders, * manuals, * patents ``` ### Data Splits The dataset provides three splits - `train` - `val` - `test` ## Dataset Creation ### Annotations #### Annotation process The labeling guideline used for training of the annotation experts are available at [DocLayNet_Labeling_Guide_Public.pdf](https://raw.githubusercontent.com/DS4SD/DocLayNet/main/assets/DocLayNet_Labeling_Guide_Public.pdf). #### Who are the annotators? Annotations are crowdsourced. ## Additional Information ### Dataset Curators The dataset is curated by the [Deep Search team](https://ds4sd.github.io/) at IBM Research. You can contact us at [deepsearch-core@zurich.ibm.com](mailto:deepsearch-core@zurich.ibm.com). Curators: - Christoph Auer, [@cau-git](https://github.com/cau-git) - Michele Dolfi, [@dolfim-ibm](https://github.com/dolfim-ibm) - Ahmed Nassar, [@nassarofficial](https://github.com/nassarofficial) - Peter Staar, [@PeterStaar-IBM](https://github.com/PeterStaar-IBM) ### Licensing Information License: [CDLA-Permissive-1.0](https://cdla.io/permissive-1-0/) ### Citation Information ```bib @article{doclaynet2022, title = {DocLayNet: A Large Human-Annotated Dataset for Document-Layout Segmentation}, doi = {10.1145/3534678.353904}, url = {https://doi.org/10.1145/3534678.3539043}, author = {Pfitzmann, Birgit and Auer, Christoph and Dolfi, Michele and Nassar, Ahmed S and Staar, Peter W J}, year = {2022}, isbn = {9781450393850}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, booktitle = {Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining}, pages = {3743–3751}, numpages = {9}, location = {Washington DC, USA}, series = {KDD '22} } ```

提供机构：

ds4sd

原始信息汇总

数据集概述

数据集基本信息

名称: DocLayNet
许可证: other
标签: layout-segmentation, COCO, document-understanding, PDF
任务类别: object-detection, image-segmentation
任务ID: instance-segmentation
大小类别: 10K<n<100K

数据集结构

数据字段

image: 图像
bboxes: 边界框列表
category_id: 类别ID列表
segmentation: 分割多边形列表
area: 区域列表
pdf_cells: PDF单元格列表
- bbox: 边界框
- font: 字体信息
  - color: 颜色
  - name: 字体名称
  - size: 字体大小
- text: 文本内容
metadata: 元数据
- coco_height: COCO高度
- coco_width: COCO宽度
- collection: 集合
- doc_category: 文档类别
- image_id: 图像ID
- num_pages: 页数
- original_filename: 原始文件名
- original_height: 原始高度
- original_width: 原始宽度
- page_hash: 页面哈希
- page_no: 页码

数据分割

train: 69375个样本，28172005254.125字节
test: 4999个样本，1996179229.125字节
val: 6489个样本，2493896901.875字节

数据集创建

标注过程

标注者: 众包
标注指南: DocLayNet_Labeling_Guide_Public.pdf

附加信息

数据集管理者

团队: Deep Search team at IBM Research
联系邮箱: deepsearch-core@zurich.ibm.com
成员:
- Christoph Auer, @cau-git
- Michele Dolfi, @dolfim-ibm
- Ahmed Nassar, @nassarofficial
- Peter Staar, @PeterStaar-IBM

许可证信息

许可证: CDLA-Permissive-1.0

引用信息

bib @article{doclaynet2022, title = {DocLayNet: A Large Human-Annotated Dataset for Document-Layout Segmentation}, doi = {10.1145/3534678.353904}, url = {https://doi.org/10.1145/3534678.3539043}, author = {Pfitzmann, Birgit and Auer, Christoph and Dolfi, Michele and Nassar, Ahmed S and Staar, Peter W J}, year = {2022}, isbn = {9781450393850}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, booktitle = {Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining}, pages = {3743–3751}, numpages = {9}, location = {Washington DC, USA}, series = {KDD 22} }

搜集汇总

数据集介绍

构建方式

在文档布局分析领域，高质量的标注数据是推动模型性能提升的关键。DocLayNet v1.1的构建过程体现了严谨的学术态度，其标注工作由经过专业训练的人工专家通过众包方式完成，确保了标注的精确性与一致性。该数据集从金融报告、科学文章、法律法规、政府招标、手册及专利等六个公开文档类别中，精心选取了80863个独特页面，涵盖了广泛的布局变异性。标注过程中遵循了详细的标注指南，并对部分页面进行了双重或三重标注，以量化标注的不确定性，为机器学习模型性能评估提供了可靠的上界参考。

特点

DocLayNet v1.1在文档理解数据集中展现出鲜明的特色。其核心优势在于提供了人类专家手工标注的布局分割黄金标准，包含11个精细定义的类别标签，如标题、文本、表格、图片、公式等，能够细致区分文档中的各类布局元素。数据集不仅提供了标准的边界框和实例分割多边形标注，还创新性地包含了PDF单元格内容信息，将视觉布局与文本内容深度融合。此外，其预定义的训练、验证和测试集确保了类别标签的比例代表性，有效防止了独特布局风格的跨集泄露，为模型评估提供了公平基准。

使用方法

该数据集主要服务于文档布局分割与目标检测等计算机视觉任务。研究人员可通过Hugging Face平台直接加载数据集，其结构清晰，包含图像、边界框、类别ID、分割多边形、PDF单元格及元数据等字段。使用前需注意，此版本数据集已放弃COCO格式，转而采用更灵活的字段组织方式。用户可依据`metadata`中的文档类别信息进行有针对性的子集分析或训练。数据集已划分好训练、验证和测试部分，便于直接用于模型训练、超参数调优及性能测试，推动文档智能分析技术的发展。

背景与挑战

背景概述

在文档智能领域，布局分割是理解复杂文档结构的关键技术，旨在精准识别并定位文档中的各类视觉元素。DocLayNet数据集由IBM研究院Deep Search团队于2022年创建，其核心研究问题聚焦于通过高质量人工标注，为多类别文档布局分割提供基准数据。该数据集涵盖了金融报告、科学文章、法律法规等六类文档，包含80863个页面，定义了标题、表格、公式等11种细粒度标签类别。通过引入冗余标注和固定数据划分，DocLayNet不仅提升了模型训练的可靠性，还推动了文档理解技术在真实场景中的应用，对学术研究与工业实践均产生了深远影响。

当前挑战

DocLayNet致力于解决文档布局分割中的核心挑战，即如何准确解析具有高度多样性和复杂结构的文档页面，例如处理表格与文本的嵌套关系、区分页眉页脚等细微布局差异。在构建过程中，数据集面临多重困难：一是标注过程需依赖专业标注员进行人工识别，确保对复杂布局的一致性理解，这带来了高昂的时间与人力成本；二是文档来源广泛，布局风格差异显著，要求标注规范具备极强的泛化能力以覆盖各类边缘案例；三是需平衡不同文档类别与标签的分布，避免数据偏差影响模型泛化性能。这些挑战共同塑造了数据集的严谨性与实用价值。

常用场景

经典使用场景

在文档智能研究领域，DocLayNet数据集为布局分割任务提供了标准化的评估基准。该数据集通过人工标注的边界框和分割多边形，精确标注了文档页面中的标题、表格、图片等11类布局元素，支持实例分割与目标检测模型的训练与验证。其涵盖金融报告、科学论文、法律文本等六类文档，确保了模型在复杂多样布局下的泛化能力，成为文档布局分析领域广泛采用的基准数据集。

衍生相关工作

基于DocLayNet的丰富标注，学术界衍生出多项经典研究工作。例如，LayoutLMv3等预训练模型将其作为多模态训练数据，提升了文档理解任务的性能；DocSegTr等端到端分割网络利用其精细标注优化了布局分割精度。同时，该数据集也催生了针对特定文档类别（如专利或招标文件）的领域自适应方法，持续推动着文档智能技术的前沿探索。

数据集最近研究