pierreguillou/DocLayNet-base

Name: pierreguillou/DocLayNet-base
Creator: pierreguillou
Published: 2023-05-17 08:56:30
License: 暂无描述

Hugging Face2023-05-17 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/pierreguillou/DocLayNet-base

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en - de - fr - ja annotations_creators: - crowdsourced license: other pretty_name: DocLayNet base size_categories: - 1K<n<10K tags: - DocLayNet - COCO - PDF - IBM - Financial-Reports - Finance - Manuals - Scientific-Articles - Science - Laws - Law - Regulations - Patents - Government-Tenders - object-detection - image-segmentation - token-classification task_categories: - object-detection - image-segmentation - token-classification task_ids: - instance-segmentation --- # Dataset Card for DocLayNet base ## About this card (01/27/2023) ### Property and license All information from this page but the content of this paragraph "About this card (01/27/2023)" has been copied/pasted from [Dataset Card for DocLayNet](https://huggingface.co/datasets/ds4sd/DocLayNet). DocLayNet is a dataset created by Deep Search (IBM Research) published under [license CDLA-Permissive-1.0](https://huggingface.co/datasets/ds4sd/DocLayNet#licensing-information). I do not claim any rights to the data taken from this dataset and published on this page. ### DocLayNet dataset [DocLayNet dataset](https://github.com/DS4SD/DocLayNet) (IBM) provides page-by-page layout segmentation ground-truth using bounding-boxes for 11 distinct class labels on 80863 unique pages from 6 document categories. Until today, the dataset can be downloaded through direct links or as a dataset from Hugging Face datasets: - direct links: [doclaynet_core.zip](https://codait-cos-dax.s3.us.cloud-object-storage.appdomain.cloud/dax-doclaynet/1.0.0/DocLayNet_core.zip) (28 GiB), [doclaynet_extra.zip](https://codait-cos-dax.s3.us.cloud-object-storage.appdomain.cloud/dax-doclaynet/1.0.0/DocLayNet_extra.zip) (7.5 GiB) - Hugging Face dataset library: [dataset DocLayNet](https://huggingface.co/datasets/ds4sd/DocLayNet) Paper: [DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis](https://arxiv.org/abs/2206.01062) (06/02/2022) ### Processing into a format facilitating its use by HF notebooks These 2 options require the downloading of all the data (approximately 30GBi), which requires downloading time (about 45 mn in Google Colab) and a large space on the hard disk. These could limit experimentation for people with low resources. Moreover, even when using the download via HF datasets library, it is necessary to download the EXTRA zip separately ([doclaynet_extra.zip](https://codait-cos-dax.s3.us.cloud-object-storage.appdomain.cloud/dax-doclaynet/1.0.0/DocLayNet_extra.zip), 7.5 GiB) to associate the annotated bounding boxes with the text extracted by OCR from the PDFs. This operation also requires additional code because the boundings boxes of the texts do not necessarily correspond to those annotated (a calculation of the percentage of area in common between the boundings boxes annotated and those of the texts makes it possible to make a comparison between them). At last, in order to use Hugging Face notebooks on fine-tuning layout models like LayoutLMv3 or LiLT, DocLayNet data must be processed in a proper format. For all these reasons, I decided to process the DocLayNet dataset: - into 3 datasets of different sizes: - [DocLayNet small](https://huggingface.co/datasets/pierreguillou/DocLayNet-small) (about 1% of DocLayNet) < 1.000k document images (691 train, 64 val, 49 test) - [DocLayNet base](https://huggingface.co/datasets/pierreguillou/DocLayNet-base) (about 10% of DocLayNet) < 10.000k document images (6910 train, 648 val, 499 test) - [DocLayNet large](https://huggingface.co/datasets/pierreguillou/DocLayNet-large) (about 100% of DocLayNet) < 100.000k document images (69.103 train, 6.480 val, 4.994 test) - with associated texts and PDFs (base64 format), - and in a format facilitating their use by HF notebooks. *Note: the layout HF notebooks will greatly help participants of the IBM [ICDAR 2023 Competition on Robust Layout Segmentation in Corporate Documents](https://ds4sd.github.io/icdar23-doclaynet/)!* ### About PDFs languages Citation of the page 3 of the [DocLayNet paper](https://arxiv.org/abs/2206.01062): "We did not control the document selection with regard to language. **The vast majority of documents contained in DocLayNet (close to 95%) are published in English language.** However, **DocLayNet also contains a number of documents in other languages such as German (2.5%), French (1.0%) and Japanese (1.0%)**. While the document language has negligible impact on the performance of computer vision methods such as object detection and segmentation models, it might prove challenging for layout analysis methods which exploit textual features." ### About PDFs categories distribution Citation of the page 3 of the [DocLayNet paper](https://arxiv.org/abs/2206.01062): "The pages in DocLayNet can be grouped into **six distinct categories**, namely **Financial Reports, Manuals, Scientific Articles, Laws & Regulations, Patents and Government Tenders**. Each document category was sourced from various repositories. For example, Financial Reports contain both free-style format annual reports which expose company-specific, artistic layouts as well as the more formal SEC filings. The two largest categories (Financial Reports and Manuals) contain a large amount of free-style layouts in order to obtain maximum variability. In the other four categories, we boosted the variability by mixing documents from independent providers, such as different government websites or publishers. In Figure 2, we show the document categories contained in DocLayNet with their respective sizes." ![DocLayNet PDFs categories distribution (source: DocLayNet paper)](https://huggingface.co/datasets/pierreguillou/DocLayNet-base/resolve/main/DocLayNet_PDFs_categories_distribution.png) ### Download & overview The size of the DocLayNet small is about 10% of the DocLayNet dataset (random selection respectively in the train, val and test files). ``` # !pip install -q datasets from datasets import load_dataset dataset_base = load_dataset("pierreguillou/DocLayNet-base") # overview of dataset_base DatasetDict({ train: Dataset({ features: ['id', 'texts', 'bboxes_block', 'bboxes_line', 'categories', 'image', 'pdf', 'page_hash', 'original_filename', 'page_no', 'num_pages', 'original_width', 'original_height', 'coco_width', 'coco_height', 'collection', 'doc_category'], num_rows: 6910 }) validation: Dataset({ features: ['id', 'texts', 'bboxes_block', 'bboxes_line', 'categories', 'image', 'pdf', 'page_hash', 'original_filename', 'page_no', 'num_pages', 'original_width', 'original_height', 'coco_width', 'coco_height', 'collection', 'doc_category'], num_rows: 648 }) test: Dataset({ features: ['id', 'texts', 'bboxes_block', 'bboxes_line', 'categories', 'image', 'pdf', 'page_hash', 'original_filename', 'page_no', 'num_pages', 'original_width', 'original_height', 'coco_width', 'coco_height', 'collection', 'doc_category'], num_rows: 499 }) }) ``` ### Annotated bounding boxes The DocLayNet base makes easy to display document image with the annotaed bounding boxes of paragraphes or lines. Check the notebook [processing_DocLayNet_dataset_to_be_used_by_layout_models_of_HF_hub.ipynb](https://github.com/piegu/language-models/blob/master/processing_DocLayNet_dataset_to_be_used_by_layout_models_of_HF_hub.ipynb) in order to get the code. #### Paragraphes ![Annotated DocLayNet document image with bounding boxes and categories of paragraphes](https://huggingface.co/datasets/pierreguillou/DocLayNet-base/resolve/main/DocLayNet_image_annotated_bounding_boxes_paragraph.png) #### Lines ![Annotated DocLayNet document image with bounding boxes and categories of lines](https://huggingface.co/datasets/pierreguillou/DocLayNet-base/resolve/main/DocLayNet_image_annotated_bounding_boxes_line.png) ### HF notebooks - [notebooks LayoutLM](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/LayoutLM) (Niels Rogge) - [notebooks LayoutLMv2](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/LayoutLMv2) (Niels Rogge) - [notebooks LayoutLMv3](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/LayoutLMv3) (Niels Rogge) - [notebooks LiLT](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/LiLT) (Niels Rogge) - [Document AI: Fine-tuning LiLT for document-understanding using Hugging Face Transformers](https://github.com/philschmid/document-ai-transformers/blob/main/training/lilt_funsd.ipynb) ([post](https://www.philschmid.de/fine-tuning-lilt#3-fine-tune-and-evaluate-lilt) of Phil Schmid) ## Table of Contents - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Dataset Structure](#dataset-structure) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Annotations](#annotations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** https://developer.ibm.com/exchanges/data/all/doclaynet/ - **Repository:** https://github.com/DS4SD/DocLayNet - **Paper:** https://doi.org/10.1145/3534678.3539043 - **Leaderboard:** - **Point of Contact:** ### Dataset Summary DocLayNet provides page-by-page layout segmentation ground-truth using bounding-boxes for 11 distinct class labels on 80863 unique pages from 6 document categories. It provides several unique features compared to related work such as PubLayNet or DocBank: 1. *Human Annotation*: DocLayNet is hand-annotated by well-trained experts, providing a gold-standard in layout segmentation through human recognition and interpretation of each page layout 2. *Large layout variability*: DocLayNet includes diverse and complex layouts from a large variety of public sources in Finance, Science, Patents, Tenders, Law texts and Manuals 3. *Detailed label set*: DocLayNet defines 11 class labels to distinguish layout features in high detail. 4. *Redundant annotations*: A fraction of the pages in DocLayNet are double- or triple-annotated, allowing to estimate annotation uncertainty and an upper-bound of achievable prediction accuracy with ML models 5. *Pre-defined train- test- and validation-sets*: DocLayNet provides fixed sets for each to ensure proportional representation of the class-labels and avoid leakage of unique layout styles across the sets. ### Supported Tasks and Leaderboards We are hosting a competition in ICDAR 2023 based on the DocLayNet dataset. For more information see https://ds4sd.github.io/icdar23-doclaynet/. ## Dataset Structure ### Data Fields DocLayNet provides four types of data assets: 1. PNG images of all pages, resized to square `1025 x 1025px` 2. Bounding-box annotations in COCO format for each PNG image 3. Extra: Single-page PDF files matching each PNG image 4. Extra: JSON file matching each PDF page, which provides the digital text cells with coordinates and content The COCO image record are defined like this example ```js ... { "id": 1, "width": 1025, "height": 1025, "file_name": "132a855ee8b23533d8ae69af0049c038171a06ddfcac892c3c6d7e6b4091c642.png", // Custom fields: "doc_category": "financial_reports" // high-level document category "collection": "ann_reports_00_04_fancy", // sub-collection name "doc_name": "NASDAQ_FFIN_2002.pdf", // original document filename "page_no": 9, // page number in original document "precedence": 0, // Annotation order, non-zero in case of redundant double- or triple-annotation }, ... ``` The `doc_category` field uses one of the following constants: ``` financial_reports, scientific_articles, laws_and_regulations, government_tenders, manuals, patents ``` ### Data Splits The dataset provides three splits - `train` - `val` - `test` ## Dataset Creation ### Annotations #### Annotation process The labeling guideline used for training of the annotation experts are available at [DocLayNet_Labeling_Guide_Public.pdf](https://raw.githubusercontent.com/DS4SD/DocLayNet/main/assets/DocLayNet_Labeling_Guide_Public.pdf). #### Who are the annotators? Annotations are crowdsourced. ## Additional Information ### Dataset Curators The dataset is curated by the [Deep Search team](https://ds4sd.github.io/) at IBM Research. You can contact us at [deepsearch-core@zurich.ibm.com](mailto:deepsearch-core@zurich.ibm.com). Curators: - Christoph Auer, [@cau-git](https://github.com/cau-git) - Michele Dolfi, [@dolfim-ibm](https://github.com/dolfim-ibm) - Ahmed Nassar, [@nassarofficial](https://github.com/nassarofficial) - Peter Staar, [@PeterStaar-IBM](https://github.com/PeterStaar-IBM) ### Licensing Information License: [CDLA-Permissive-1.0](https://cdla.io/permissive-1-0/) ### Citation Information ```bib @article{doclaynet2022, title = {DocLayNet: A Large Human-Annotated Dataset for Document-Layout Segmentation}, doi = {10.1145/3534678.353904}, url = {https://doi.org/10.1145/3534678.3539043}, author = {Pfitzmann, Birgit and Auer, Christoph and Dolfi, Michele and Nassar, Ahmed S and Staar, Peter W J}, year = {2022}, isbn = {9781450393850}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, booktitle = {Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining}, pages = {3743–3751}, numpages = {9}, location = {Washington DC, USA}, series = {KDD '22} } ``` ### Contributions Thanks to [@dolfim-ibm](https://github.com/dolfim-ibm), [@cau-git](https://github.com/cau-git) for adding this dataset.

提供机构：

pierreguillou

原始信息汇总

数据集概述

数据集名称

名称: DocLayNet base
别名: DocLayNet

数据集属性

语言: 英语 (95%), 德语 (2.5%), 法语 (1.0%), 日语 (1.0%)
许可证: CDLA-Permissive-1.0
大小类别: 1K<n<10K
标签: DocLayNet, COCO, PDF, IBM, Financial-Reports, Finance, Manuals, Scientific-Articles, Science, Laws, Law, Regulations, Patents, Government-Tenders, object-detection, image-segmentation, token-classification
任务类别: object-detection, image-segmentation, token-classification
任务ID: instance-segmentation

数据集内容

描述: DocLayNet提供页面级别的布局分割地面实况，使用边界框为6种文档类别的80,863个独特页面上的11个不同类别标签。
文档类别: 财务报告, 手册, 科学文章, 法律与法规, 专利, 政府招标
数据下载:
- 直接链接: doclaynet_core.zip (28 GiB), doclaynet_extra.zip (7.5 GiB)
- Hugging Face数据集库: 数据集 DocLayNet

数据集处理

处理目的: 为了便于使用Hugging Face笔记本，DocLayNet数据必须以适当的格式处理。
处理结果:
- 三个不同大小的数据集:
  - DocLayNet small (< 1.000k 文档图像)
  - DocLayNet base (< 10.000k 文档图像)
  - DocLayNet large (< 100.000k 文档图像)
- 附带文本和PDF（base64格式）
- 格式便于HF笔记本使用

数据集使用

使用场景: 用于IBM的ICDAR 2023竞赛，关于企业文档中的鲁棒布局分割。
示例代码: 使用Hugging Face笔记本进行布局模型微调，如LayoutLMv3或LiLT。

数据集详细信息

数据字段: id, texts, bboxes_block, bboxes_line, categories, image, pdf, page_hash, original_filename, page_no, num_pages, original_width, original_height, coco_width, coco_height, collection, doc_category
数据分割: 训练集, 验证集, 测试集

数据集创建

注释创建者: 众包
注释过程: 使用DocLayNet_Labeling_Guide_Public.pdf作为训练注释专家的指南。

附加信息

数据集维护者: Deep Search团队，IBM研究
联系方式: deepsearch-core@zurich.ibm.com
贡献者: @dolfim-ibm, @cau-git

搜集汇总

数据集介绍

构建方式

在文档布局分析领域，构建高质量标注数据集是推动模型性能提升的关键。DocLayNet-base数据集源自IBM Deep Search团队创建的原始DocLayNet数据集，通过系统化处理流程构建而成。该数据集从原始80863个页面中抽取约10%的样本，形成包含6910个训练页面、648个验证页面和499个测试页面的子集。构建过程中，不仅保留了原始的人类专家标注的边界框信息，还集成了OCR提取的文本内容及PDF文档的base64编码，确保了数据的多模态完整性。所有页面均被统一处理为1025×1025像素的PNG图像，并按照COCO格式组织标注，涵盖段落和行级两个层次的边界框坐标与类别标签。这种构建方式既降低了数据存储与加载的计算开销，又保持了原始数据集的布局多样性和标注精度。

特点

DocLayNet-base数据集展现出多方面的显著特点。其核心优势在于人类专家标注的高质量标签，为文档布局分割任务提供了可靠的金标准。数据集涵盖六大文档类别，包括财务报告、科学文章、法律法规、政府招标、手册和专利，这些类别源自多样化的公开来源，确保了布局风格的广泛代表性。数据集中定义了11个细粒度类别标签，能够精确区分标题、段落、列表、表格等文档元素。部分页面经过双重或三重标注，为评估标注一致性和模型性能上限提供了依据。此外，数据集预先划分了训练、验证和测试集，各类别标签比例均衡，避免了布局风格泄露问题。文本内容与图像、PDF的多模态对齐，进一步扩展了其在视觉-语言联合建模中的应用潜力。

使用方法

该数据集专为文档布局分析与理解任务设计，可广泛应用于目标检测、实例分割和令牌分类等机器学习任务。使用者可通过Hugging Face的datasets库直接加载数据集，利用其预处理的统一格式快速集成到训练流程中。数据集中每个样本包含图像、文本、边界框、类别及PDF等多种特征字段，便于进行端到端的模型训练。研究人员可基于该数据集微调LayoutLMv3、LiLT等先进的文档理解模型，或参与ICDAR 2023相关竞赛以评估算法性能。数据集的标准化格式与丰富注释降低了实验门槛，使得资源有限的研究者也能高效开展文档布局分割与内容提取的探索工作。

背景与挑战

背景概述

文档布局分析作为文档智能领域的关键任务，旨在识别和理解文档中各类视觉元素的几何位置与语义类别。2022年，IBM研究院Deep Search团队推出了DocLayNet数据集，其核心研究问题聚焦于为复杂多样的文档布局提供高质量的人工标注基准。该数据集收录了来自金融报告、科学论文、法律法规等六大类别的八万余页文档，并定义了11种精细的布局标签类别。通过专家级人工标注与冗余标注机制，DocLayNet显著提升了布局分割任务的可靠性与可解释性，为后续的文档理解模型训练与评估奠定了坚实的数据基础，推动了该领域向更精准、更鲁棒的方向发展。

当前挑战

在文档布局分析领域，模型需应对文档版式的极端多样性，例如金融报告中自由艺术化的排版与法律文本中严谨的表格结构并存，这要求算法具备强大的泛化能力。DocLayNet构建过程中，挑战主要源于高质量标注的获取：一方面，招募并培训专业标注人员以确保对复杂布局元素（如图表、公式、多栏文本）的准确识别与分类，耗费了大量人力与时间成本；另一方面，为保持数据集的平衡性与代表性，需从多个异构来源系统收集文档，并设计固定划分的数据集以防止布局风格泄露，这些过程均涉及细致的工程设计与质量控制。

常用场景

经典使用场景

在文档布局分析领域，DocLayNet-base数据集以其精心标注的边界框和丰富的文档类别，成为训练和评估视觉文档理解模型的基准资源。该数据集涵盖了金融报告、科学文章、法律法规等六类文档，通过人类专家标注的11种布局标签，为模型提供了识别页面中段落、表格、图表等元素的标准。研究者常利用其进行对象检测和实例分割任务，优化模型在复杂文档结构中的解析能力，推动布局分析技术向更高精度迈进。

衍生相关工作

围绕DocLayNet-base，学术界衍生了一系列经典研究工作，尤其在基于Transformer的文档理解模型优化方面表现突出。例如，LayoutLMv3和LiLT等模型利用该数据集进行微调，显著提升了多模态文档表征能力。此外，该数据集还支撑了ICDAR 2023文档布局分割竞赛，促进了布局分析算法的创新与比较，为文档人工智能领域注入了持续的研究动力。

数据集最近研究