ds4sd/DocLayNet-v1.1
收藏Hugging Face2023-09-01 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/ds4sd/DocLayNet-v1.1
下载链接
链接失效反馈官方服务:
资源简介:
---
annotations_creators:
- crowdsourced
license: other
pretty_name: DocLayNet
size_categories:
- 10K<n<100K
tags:
- layout-segmentation
- COCO
- document-understanding
- PDF
task_categories:
- object-detection
- image-segmentation
task_ids:
- instance-segmentation
dataset_info:
features:
- name: image
dtype: image
- name: bboxes
sequence:
sequence: float64
- name: category_id
sequence: int64
- name: segmentation
sequence:
sequence:
sequence: float64
- name: area
sequence: float64
- name: pdf_cells
list:
list:
- name: bbox
sequence: float64
- name: font
struct:
- name: color
sequence: int64
- name: name
dtype: string
- name: size
dtype: float64
- name: text
dtype: string
- name: metadata
struct:
- name: coco_height
dtype: int64
- name: coco_width
dtype: int64
- name: collection
dtype: string
- name: doc_category
dtype: string
- name: image_id
dtype: int64
- name: num_pages
dtype: int64
- name: original_filename
dtype: string
- name: original_height
dtype: float64
- name: original_width
dtype: float64
- name: page_hash
dtype: string
- name: page_no
dtype: int64
splits:
- name: train
num_bytes: 28172005254.125
num_examples: 69375
- name: test
num_bytes: 1996179229.125
num_examples: 4999
- name: val
num_bytes: 2493896901.875
num_examples: 6489
download_size: 7766115331
dataset_size: 32662081385.125
---
# Dataset Card for DocLayNet v1.1
## Table of Contents
- [Table of Contents](#table-of-contents)
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Dataset Structure](#dataset-structure)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Annotations](#annotations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:** https://developer.ibm.com/exchanges/data/all/doclaynet/
- **Repository:** https://github.com/DS4SD/DocLayNet
- **Paper:** https://doi.org/10.1145/3534678.3539043
### Dataset Summary
DocLayNet provides page-by-page layout segmentation ground-truth using bounding-boxes for 11 distinct class labels on 80863 unique pages from 6 document categories. It provides several unique features compared to related work such as PubLayNet or DocBank:
1. *Human Annotation*: DocLayNet is hand-annotated by well-trained experts, providing a gold-standard in layout segmentation through human recognition and interpretation of each page layout
2. *Large layout variability*: DocLayNet includes diverse and complex layouts from a large variety of public sources in Finance, Science, Patents, Tenders, Law texts and Manuals
3. *Detailed label set*: DocLayNet defines 11 class labels to distinguish layout features in high detail.
4. *Redundant annotations*: A fraction of the pages in DocLayNet are double- or triple-annotated, allowing to estimate annotation uncertainty and an upper-bound of achievable prediction accuracy with ML models
5. *Pre-defined train- test- and validation-sets*: DocLayNet provides fixed sets for each to ensure proportional representation of the class-labels and avoid leakage of unique layout styles across the sets.
## Dataset Structure
This dataset is structured differently from the other repository [ds4sd/DocLayNet](https://huggingface.co/datasets/ds4sd/DocLayNet), as this one includes the content (PDF cells) of the detections, and abandons the COCO format.
* `image`: page PIL image.
* `bboxes`: a list of layout bounding boxes.
* `category_id`: a list of class ids corresponding to the bounding boxes.
* `segmentation`: a list of layout segmentation polygons.
* `pdf_cells`: a list of lists corresponding to `bbox`. Each list contains the PDF cells (content) inside the bbox.
* `metadata`: page and document metadetails.
Bounding boxes classes / categories:
```
1: Caption
2: Footnote
3: Formula
4: List-item
5: Page-footer
6: Page-header
7: Picture
8: Section-header
9: Table
10: Text
11: Title
```
The `["metadata"]["doc_category"]` field uses one of the following constants:
```
* financial_reports,
* scientific_articles,
* laws_and_regulations,
* government_tenders,
* manuals,
* patents
```
### Data Splits
The dataset provides three splits
- `train`
- `val`
- `test`
## Dataset Creation
### Annotations
#### Annotation process
The labeling guideline used for training of the annotation experts are available at [DocLayNet_Labeling_Guide_Public.pdf](https://raw.githubusercontent.com/DS4SD/DocLayNet/main/assets/DocLayNet_Labeling_Guide_Public.pdf).
#### Who are the annotators?
Annotations are crowdsourced.
## Additional Information
### Dataset Curators
The dataset is curated by the [Deep Search team](https://ds4sd.github.io/) at IBM Research.
You can contact us at [deepsearch-core@zurich.ibm.com](mailto:deepsearch-core@zurich.ibm.com).
Curators:
- Christoph Auer, [@cau-git](https://github.com/cau-git)
- Michele Dolfi, [@dolfim-ibm](https://github.com/dolfim-ibm)
- Ahmed Nassar, [@nassarofficial](https://github.com/nassarofficial)
- Peter Staar, [@PeterStaar-IBM](https://github.com/PeterStaar-IBM)
### Licensing Information
License: [CDLA-Permissive-1.0](https://cdla.io/permissive-1-0/)
### Citation Information
```bib
@article{doclaynet2022,
title = {DocLayNet: A Large Human-Annotated Dataset for Document-Layout Segmentation},
doi = {10.1145/3534678.353904},
url = {https://doi.org/10.1145/3534678.3539043},
author = {Pfitzmann, Birgit and Auer, Christoph and Dolfi, Michele and Nassar, Ahmed S and Staar, Peter W J},
year = {2022},
isbn = {9781450393850},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
booktitle = {Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining},
pages = {3743–3751},
numpages = {9},
location = {Washington DC, USA},
series = {KDD '22}
}
```
提供机构:
ds4sd
原始信息汇总
数据集概述
数据集基本信息
- 名称: DocLayNet
- 许可证: other
- 标签: layout-segmentation, COCO, document-understanding, PDF
- 任务类别: object-detection, image-segmentation
- 任务ID: instance-segmentation
- 大小类别: 10K<n<100K
数据集结构
数据字段
- image: 图像
- bboxes: 边界框列表
- category_id: 类别ID列表
- segmentation: 分割多边形列表
- area: 区域列表
- pdf_cells: PDF单元格列表
- bbox: 边界框
- font: 字体信息
- color: 颜色
- name: 字体名称
- size: 字体大小
- text: 文本内容
- metadata: 元数据
- coco_height: COCO高度
- coco_width: COCO宽度
- collection: 集合
- doc_category: 文档类别
- image_id: 图像ID
- num_pages: 页数
- original_filename: 原始文件名
- original_height: 原始高度
- original_width: 原始宽度
- page_hash: 页面哈希
- page_no: 页码
数据分割
- train: 69375个样本,28172005254.125字节
- test: 4999个样本,1996179229.125字节
- val: 6489个样本,2493896901.875字节
数据集创建
标注过程
- 标注者: 众包
- 标注指南: DocLayNet_Labeling_Guide_Public.pdf
附加信息
数据集管理者
- 团队: Deep Search team at IBM Research
- 联系邮箱: deepsearch-core@zurich.ibm.com
- 成员:
- Christoph Auer, @cau-git
- Michele Dolfi, @dolfim-ibm
- Ahmed Nassar, @nassarofficial
- Peter Staar, @PeterStaar-IBM
许可证信息
- 许可证: CDLA-Permissive-1.0
引用信息
bib @article{doclaynet2022, title = {DocLayNet: A Large Human-Annotated Dataset for Document-Layout Segmentation}, doi = {10.1145/3534678.353904}, url = {https://doi.org/10.1145/3534678.3539043}, author = {Pfitzmann, Birgit and Auer, Christoph and Dolfi, Michele and Nassar, Ahmed S and Staar, Peter W J}, year = {2022}, isbn = {9781450393850}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, booktitle = {Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining}, pages = {3743–3751}, numpages = {9}, location = {Washington DC, USA}, series = {KDD 22} }
搜集汇总
数据集介绍

构建方式
在文档布局分析领域,高质量的标注数据是推动模型性能提升的关键。DocLayNet v1.1的构建过程体现了严谨的学术态度,其标注工作由经过专业训练的人工专家通过众包方式完成,确保了标注的精确性与一致性。该数据集从金融报告、科学文章、法律法规、政府招标、手册及专利等六个公开文档类别中,精心选取了80863个独特页面,涵盖了广泛的布局变异性。标注过程中遵循了详细的标注指南,并对部分页面进行了双重或三重标注,以量化标注的不确定性,为机器学习模型性能评估提供了可靠的上界参考。
特点
DocLayNet v1.1在文档理解数据集中展现出鲜明的特色。其核心优势在于提供了人类专家手工标注的布局分割黄金标准,包含11个精细定义的类别标签,如标题、文本、表格、图片、公式等,能够细致区分文档中的各类布局元素。数据集不仅提供了标准的边界框和实例分割多边形标注,还创新性地包含了PDF单元格内容信息,将视觉布局与文本内容深度融合。此外,其预定义的训练、验证和测试集确保了类别标签的比例代表性,有效防止了独特布局风格的跨集泄露,为模型评估提供了公平基准。
使用方法
该数据集主要服务于文档布局分割与目标检测等计算机视觉任务。研究人员可通过Hugging Face平台直接加载数据集,其结构清晰,包含图像、边界框、类别ID、分割多边形、PDF单元格及元数据等字段。使用前需注意,此版本数据集已放弃COCO格式,转而采用更灵活的字段组织方式。用户可依据`metadata`中的文档类别信息进行有针对性的子集分析或训练。数据集已划分好训练、验证和测试部分,便于直接用于模型训练、超参数调优及性能测试,推动文档智能分析技术的发展。
背景与挑战
背景概述
在文档智能领域,布局分割是理解复杂文档结构的关键技术,旨在精准识别并定位文档中的各类视觉元素。DocLayNet数据集由IBM研究院Deep Search团队于2022年创建,其核心研究问题聚焦于通过高质量人工标注,为多类别文档布局分割提供基准数据。该数据集涵盖了金融报告、科学文章、法律法规等六类文档,包含80863个页面,定义了标题、表格、公式等11种细粒度标签类别。通过引入冗余标注和固定数据划分,DocLayNet不仅提升了模型训练的可靠性,还推动了文档理解技术在真实场景中的应用,对学术研究与工业实践均产生了深远影响。
当前挑战
DocLayNet致力于解决文档布局分割中的核心挑战,即如何准确解析具有高度多样性和复杂结构的文档页面,例如处理表格与文本的嵌套关系、区分页眉页脚等细微布局差异。在构建过程中,数据集面临多重困难:一是标注过程需依赖专业标注员进行人工识别,确保对复杂布局的一致性理解,这带来了高昂的时间与人力成本;二是文档来源广泛,布局风格差异显著,要求标注规范具备极强的泛化能力以覆盖各类边缘案例;三是需平衡不同文档类别与标签的分布,避免数据偏差影响模型泛化性能。这些挑战共同塑造了数据集的严谨性与实用价值。
常用场景
经典使用场景
在文档智能研究领域,DocLayNet数据集为布局分割任务提供了标准化的评估基准。该数据集通过人工标注的边界框和分割多边形,精确标注了文档页面中的标题、表格、图片等11类布局元素,支持实例分割与目标检测模型的训练与验证。其涵盖金融报告、科学论文、法律文本等六类文档,确保了模型在复杂多样布局下的泛化能力,成为文档布局分析领域广泛采用的基准数据集。
衍生相关工作
基于DocLayNet的丰富标注,学术界衍生出多项经典研究工作。例如,LayoutLMv3等预训练模型将其作为多模态训练数据,提升了文档理解任务的性能;DocSegTr等端到端分割网络利用其精细标注优化了布局分割精度。同时,该数据集也催生了针对特定文档类别(如专利或招标文件)的领域自适应方法,持续推动着文档智能技术的前沿探索。
数据集最近研究
最新研究方向
在文档智能领域,DocLayNet数据集以其大规模人工标注和复杂布局多样性,正推动着文档布局分割技术的前沿探索。当前研究聚焦于利用其精细的类别标签和冗余标注特性,开发能够精准识别金融报告、科学论文等跨领域文档中标题、表格、公式等元素的深度学习模型。这些进展不仅提升了自动化文档理解的准确性,还为知识挖掘与数字化归档提供了关键技术支撑,在金融科技与学术出版等行业中展现出广泛的应用潜力。
以上内容由遇见数据集搜集并总结生成



