five

ds4sd/DocLayNet-v1.1

收藏
Hugging Face2023-09-01 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/ds4sd/DocLayNet-v1.1
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - crowdsourced license: other pretty_name: DocLayNet size_categories: - 10K<n<100K tags: - layout-segmentation - COCO - document-understanding - PDF task_categories: - object-detection - image-segmentation task_ids: - instance-segmentation dataset_info: features: - name: image dtype: image - name: bboxes sequence: sequence: float64 - name: category_id sequence: int64 - name: segmentation sequence: sequence: sequence: float64 - name: area sequence: float64 - name: pdf_cells list: list: - name: bbox sequence: float64 - name: font struct: - name: color sequence: int64 - name: name dtype: string - name: size dtype: float64 - name: text dtype: string - name: metadata struct: - name: coco_height dtype: int64 - name: coco_width dtype: int64 - name: collection dtype: string - name: doc_category dtype: string - name: image_id dtype: int64 - name: num_pages dtype: int64 - name: original_filename dtype: string - name: original_height dtype: float64 - name: original_width dtype: float64 - name: page_hash dtype: string - name: page_no dtype: int64 splits: - name: train num_bytes: 28172005254.125 num_examples: 69375 - name: test num_bytes: 1996179229.125 num_examples: 4999 - name: val num_bytes: 2493896901.875 num_examples: 6489 download_size: 7766115331 dataset_size: 32662081385.125 --- # Dataset Card for DocLayNet v1.1 ## Table of Contents - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Dataset Structure](#dataset-structure) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Annotations](#annotations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** https://developer.ibm.com/exchanges/data/all/doclaynet/ - **Repository:** https://github.com/DS4SD/DocLayNet - **Paper:** https://doi.org/10.1145/3534678.3539043 ### Dataset Summary DocLayNet provides page-by-page layout segmentation ground-truth using bounding-boxes for 11 distinct class labels on 80863 unique pages from 6 document categories. It provides several unique features compared to related work such as PubLayNet or DocBank: 1. *Human Annotation*: DocLayNet is hand-annotated by well-trained experts, providing a gold-standard in layout segmentation through human recognition and interpretation of each page layout 2. *Large layout variability*: DocLayNet includes diverse and complex layouts from a large variety of public sources in Finance, Science, Patents, Tenders, Law texts and Manuals 3. *Detailed label set*: DocLayNet defines 11 class labels to distinguish layout features in high detail. 4. *Redundant annotations*: A fraction of the pages in DocLayNet are double- or triple-annotated, allowing to estimate annotation uncertainty and an upper-bound of achievable prediction accuracy with ML models 5. *Pre-defined train- test- and validation-sets*: DocLayNet provides fixed sets for each to ensure proportional representation of the class-labels and avoid leakage of unique layout styles across the sets. ## Dataset Structure This dataset is structured differently from the other repository [ds4sd/DocLayNet](https://huggingface.co/datasets/ds4sd/DocLayNet), as this one includes the content (PDF cells) of the detections, and abandons the COCO format. * `image`: page PIL image. * `bboxes`: a list of layout bounding boxes. * `category_id`: a list of class ids corresponding to the bounding boxes. * `segmentation`: a list of layout segmentation polygons. * `pdf_cells`: a list of lists corresponding to `bbox`. Each list contains the PDF cells (content) inside the bbox. * `metadata`: page and document metadetails. Bounding boxes classes / categories: ``` 1: Caption 2: Footnote 3: Formula 4: List-item 5: Page-footer 6: Page-header 7: Picture 8: Section-header 9: Table 10: Text 11: Title ``` The `["metadata"]["doc_category"]` field uses one of the following constants: ``` * financial_reports, * scientific_articles, * laws_and_regulations, * government_tenders, * manuals, * patents ``` ### Data Splits The dataset provides three splits - `train` - `val` - `test` ## Dataset Creation ### Annotations #### Annotation process The labeling guideline used for training of the annotation experts are available at [DocLayNet_Labeling_Guide_Public.pdf](https://raw.githubusercontent.com/DS4SD/DocLayNet/main/assets/DocLayNet_Labeling_Guide_Public.pdf). #### Who are the annotators? Annotations are crowdsourced. ## Additional Information ### Dataset Curators The dataset is curated by the [Deep Search team](https://ds4sd.github.io/) at IBM Research. You can contact us at [deepsearch-core@zurich.ibm.com](mailto:deepsearch-core@zurich.ibm.com). Curators: - Christoph Auer, [@cau-git](https://github.com/cau-git) - Michele Dolfi, [@dolfim-ibm](https://github.com/dolfim-ibm) - Ahmed Nassar, [@nassarofficial](https://github.com/nassarofficial) - Peter Staar, [@PeterStaar-IBM](https://github.com/PeterStaar-IBM) ### Licensing Information License: [CDLA-Permissive-1.0](https://cdla.io/permissive-1-0/) ### Citation Information ```bib @article{doclaynet2022, title = {DocLayNet: A Large Human-Annotated Dataset for Document-Layout Segmentation}, doi = {10.1145/3534678.353904}, url = {https://doi.org/10.1145/3534678.3539043}, author = {Pfitzmann, Birgit and Auer, Christoph and Dolfi, Michele and Nassar, Ahmed S and Staar, Peter W J}, year = {2022}, isbn = {9781450393850}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, booktitle = {Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining}, pages = {3743–3751}, numpages = {9}, location = {Washington DC, USA}, series = {KDD '22} } ```
提供机构:
ds4sd
原始信息汇总

数据集概述

数据集基本信息

  • 名称: DocLayNet
  • 许可证: other
  • 标签: layout-segmentation, COCO, document-understanding, PDF
  • 任务类别: object-detection, image-segmentation
  • 任务ID: instance-segmentation
  • 大小类别: 10K<n<100K

数据集结构

数据字段

  • image: 图像
  • bboxes: 边界框列表
  • category_id: 类别ID列表
  • segmentation: 分割多边形列表
  • area: 区域列表
  • pdf_cells: PDF单元格列表
    • bbox: 边界框
    • font: 字体信息
      • color: 颜色
      • name: 字体名称
      • size: 字体大小
    • text: 文本内容
  • metadata: 元数据
    • coco_height: COCO高度
    • coco_width: COCO宽度
    • collection: 集合
    • doc_category: 文档类别
    • image_id: 图像ID
    • num_pages: 页数
    • original_filename: 原始文件名
    • original_height: 原始高度
    • original_width: 原始宽度
    • page_hash: 页面哈希
    • page_no: 页码

数据分割

  • train: 69375个样本,28172005254.125字节
  • test: 4999个样本,1996179229.125字节
  • val: 6489个样本,2493896901.875字节

数据集创建

标注过程

附加信息

数据集管理者

许可证信息

引用信息

bib @article{doclaynet2022, title = {DocLayNet: A Large Human-Annotated Dataset for Document-Layout Segmentation}, doi = {10.1145/3534678.353904}, url = {https://doi.org/10.1145/3534678.3539043}, author = {Pfitzmann, Birgit and Auer, Christoph and Dolfi, Michele and Nassar, Ahmed S and Staar, Peter W J}, year = {2022}, isbn = {9781450393850}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, booktitle = {Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining}, pages = {3743–3751}, numpages = {9}, location = {Washington DC, USA}, series = {KDD 22} }

搜集汇总
数据集介绍
main_image_url
构建方式
在文档布局分析领域,高质量的标注数据是推动模型性能提升的关键。DocLayNet v1.1的构建过程体现了严谨的学术态度,其标注工作由经过专业训练的人工专家通过众包方式完成,确保了标注的精确性与一致性。该数据集从金融报告、科学文章、法律法规、政府招标、手册及专利等六个公开文档类别中,精心选取了80863个独特页面,涵盖了广泛的布局变异性。标注过程中遵循了详细的标注指南,并对部分页面进行了双重或三重标注,以量化标注的不确定性,为机器学习模型性能评估提供了可靠的上界参考。
特点
DocLayNet v1.1在文档理解数据集中展现出鲜明的特色。其核心优势在于提供了人类专家手工标注的布局分割黄金标准,包含11个精细定义的类别标签,如标题、文本、表格、图片、公式等,能够细致区分文档中的各类布局元素。数据集不仅提供了标准的边界框和实例分割多边形标注,还创新性地包含了PDF单元格内容信息,将视觉布局与文本内容深度融合。此外,其预定义的训练、验证和测试集确保了类别标签的比例代表性,有效防止了独特布局风格的跨集泄露,为模型评估提供了公平基准。
使用方法
该数据集主要服务于文档布局分割与目标检测等计算机视觉任务。研究人员可通过Hugging Face平台直接加载数据集,其结构清晰,包含图像、边界框、类别ID、分割多边形、PDF单元格及元数据等字段。使用前需注意,此版本数据集已放弃COCO格式,转而采用更灵活的字段组织方式。用户可依据`metadata`中的文档类别信息进行有针对性的子集分析或训练。数据集已划分好训练、验证和测试部分,便于直接用于模型训练、超参数调优及性能测试,推动文档智能分析技术的发展。
背景与挑战
背景概述
在文档智能领域,布局分割是理解复杂文档结构的关键技术,旨在精准识别并定位文档中的各类视觉元素。DocLayNet数据集由IBM研究院Deep Search团队于2022年创建,其核心研究问题聚焦于通过高质量人工标注,为多类别文档布局分割提供基准数据。该数据集涵盖了金融报告、科学文章、法律法规等六类文档,包含80863个页面,定义了标题、表格、公式等11种细粒度标签类别。通过引入冗余标注和固定数据划分,DocLayNet不仅提升了模型训练的可靠性,还推动了文档理解技术在真实场景中的应用,对学术研究与工业实践均产生了深远影响。
当前挑战
DocLayNet致力于解决文档布局分割中的核心挑战,即如何准确解析具有高度多样性和复杂结构的文档页面,例如处理表格与文本的嵌套关系、区分页眉页脚等细微布局差异。在构建过程中,数据集面临多重困难:一是标注过程需依赖专业标注员进行人工识别,确保对复杂布局的一致性理解,这带来了高昂的时间与人力成本;二是文档来源广泛,布局风格差异显著,要求标注规范具备极强的泛化能力以覆盖各类边缘案例;三是需平衡不同文档类别与标签的分布,避免数据偏差影响模型泛化性能。这些挑战共同塑造了数据集的严谨性与实用价值。
常用场景
经典使用场景
在文档智能研究领域,DocLayNet数据集为布局分割任务提供了标准化的评估基准。该数据集通过人工标注的边界框和分割多边形,精确标注了文档页面中的标题、表格、图片等11类布局元素,支持实例分割与目标检测模型的训练与验证。其涵盖金融报告、科学论文、法律文本等六类文档,确保了模型在复杂多样布局下的泛化能力,成为文档布局分析领域广泛采用的基准数据集。
衍生相关工作
基于DocLayNet的丰富标注,学术界衍生出多项经典研究工作。例如,LayoutLMv3等预训练模型将其作为多模态训练数据,提升了文档理解任务的性能;DocSegTr等端到端分割网络利用其精细标注优化了布局分割精度。同时,该数据集也催生了针对特定文档类别(如专利或招标文件)的领域自适应方法,持续推动着文档智能技术的前沿探索。
数据集最近研究
最新研究方向
在文档智能领域,DocLayNet数据集以其大规模人工标注和复杂布局多样性,正推动着文档布局分割技术的前沿探索。当前研究聚焦于利用其精细的类别标签和冗余标注特性,开发能够精准识别金融报告、科学论文等跨领域文档中标题、表格、公式等元素的深度学习模型。这些进展不仅提升了自动化文档理解的准确性,还为知识挖掘与数字化归档提供了关键技术支撑,在金融科技与学术出版等行业中展现出广泛的应用潜力。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作