ds4sd/icdar2023-doclaynet

Name: ds4sd/icdar2023-doclaynet
Creator: ds4sd
Published: 2023-02-01 06:39:27
License: 暂无描述

Hugging Face2023-02-01 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/ds4sd/icdar2023-doclaynet

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - crowdsourced license: apache-2.0 pretty_name: ICDAR 2023 Competition on Robust Layout Segmentation in Corporate Documents size_categories: - n<1K tags: - layout-segmentation - COCO - document-understanding - PDF - icdar - competition task_categories: - object-detection - image-segmentation task_ids: - instance-segmentation --- # Dataset Card for ICDAR 2023 Competition on Robust Layout Segmentation in Corporate Documents ## Table of Contents - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Dataset Structure](#dataset-structure) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Annotations](#annotations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** https://ds4sd.github.io/icdar23-doclaynet/ - **Leaderboard:** https://eval.ai/web/challenges/challenge-page/1923/leaderboard - **Point of Contact:** ### Dataset Summary This is the official competition dataset for the _ICDAR 2023 Competition on Robust Layout Segmentation in Corporate Documents_. You are invited to advance the research in accurately segmenting the layout on a broad range of document styles and domains. To achieve this, we challenge you to develop a model that can correctly identify and segment the layout components in document pages as bounding boxes on a competition data-set we provide. For more information see https://ds4sd.github.io/icdar23-doclaynet/. #### Training resources In our recently published [DocLayNet](https://github.com/DS4SD/DocLayNet) dataset, which contains 80k+ human-annotated document pages exposing diverse layouts, we define 11 classes for layout components (paragraphs, headings, tables, figures, lists, mathematical formulas and several more). We encourage you to use this dataset for training and internal evaluation of your solution. Further, you may consider any other publicly available document layout dataset for training (e.g. [PubLayNet](https://github.com/ibm-aur-nlp/PubLayNet), [DocBank](https://github.com/doc-analysis/DocBank)). ### Supported Tasks and Leaderboards This is the official dataset of the ICDAR 2023 Competition on Robust Layout Segmentation in Corporate Documents. For more information see https://ds4sd.github.io/icdar23-doclaynet/. #### Evaluation Metric Your submissions on our [EvalAI challenge](https://eval.ai/web/challenges/challenge-page/1923/) will be evaluated using the Mean Average Precision (mAP) @ Intersection-over-Union (IoU) [0.50:0.95] metric, as used in the [COCO](https://cocodataset.org/) object detection competition. In detail, we will calculate the average precision for a sequence of IoU thresholds ranging from 0.50 to 0.95 with a step size of 0.05. This metric is computed for every document category in the competition-dataset. Then the mean of the average precisions on all categories is computed as the final score. #### Submission We ask you to upload a JSON file in [COCO results format](https://cocodataset.org/#format-results) [here](https://eval.ai/web/challenges/challenge-page/1923/submission), with complete layout bounding-boxes for each page sample. The given `image_id`s must correspond to the ones we publish with the competition data-set's `coco.json`. For each submission you make, the computed mAP will be provided for each category as well as combined. The [leaderboard](https://eval.ai/web/challenges/challenge-page/1923/leaderboard/4545/Total) will be ranked based on the overall mAP. ## Dataset Structure ### Data Fields DocLayNet provides four types of data assets: 1. PNG images of all pages, resized to square `1025 x 1025px` 2. ~~Bounding-box annotations in COCO format for each PNG image~~ (annotations will be released at the end of the competition) 3. Extra: Single-page PDF files matching each PNG image 4. Extra: JSON file matching each PDF page, which provides the digital text cells with coordinates and content The COCO image record are defined like this example ```js ... { "id": 1, "width": 1025, "height": 1025, "file_name": "132a855ee8b23533d8ae69af0049c038171a06ddfcac892c3c6d7e6b4091c642.png", // Custom fields: "doc_category": "financial_reports" // high-level document category "collection": "ann_reports_00_04_fancy", // sub-collection name "doc_name": "NASDAQ_FFIN_2002.pdf", // original document filename "page_no": 9, // page number in original document "precedence": 0, // Annotation order, non-zero in case of redundant double- or triple-annotation }, ... ``` The `doc_category` field uses one of the following constants: ``` reports, manuals, patents, pthers ``` ### Data Splits The dataset provides three splits - `dev`, which is extracted from the [DocLayNet](https://github.com/DS4SD/DocLayNet) dataset - `test`, which contains new data for the competition ## Dataset Creation ### Annotations #### Annotation process The labeling guideline used for training of the annotation experts are available at [DocLayNet_Labeling_Guide_Public.pdf](https://raw.githubusercontent.com/DS4SD/DocLayNet/main/assets/DocLayNet_Labeling_Guide_Public.pdf). #### Who are the annotators? Annotations are crowdsourced. ## Additional Information ### Dataset Curators The dataset is curated by the [Deep Search team](https://ds4sd.github.io/) at IBM Research. You can contact us at [deepsearch-core@zurich.ibm.com](mailto:deepsearch-core@zurich.ibm.com). Curators: - Christoph Auer, [@cau-git](https://github.com/cau-git) - Michele Dolfi, [@dolfim-ibm](https://github.com/dolfim-ibm) - Ahmed Nassar, [@nassarofficial](https://github.com/nassarofficial) - Peter Staar, [@PeterStaar-IBM](https://github.com/PeterStaar-IBM) ### Licensing Information License: [CDLA-Permissive-1.0](https://cdla.io/permissive-1-0/) ### Citation Information A publication will be submitted at the end of the competition. Meanwhile, we suggest the cite our original dataset paper. ```bib @article{doclaynet2022, title = {DocLayNet: A Large Human-Annotated Dataset for Document-Layout Segmentation}, doi = {10.1145/3534678.353904}, url = {https://doi.org/10.1145/3534678.3539043}, author = {Pfitzmann, Birgit and Auer, Christoph and Dolfi, Michele and Nassar, Ahmed S and Staar, Peter W J}, year = {2022}, isbn = {9781450393850}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, booktitle = {Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining}, pages = {3743–3751}, numpages = {9}, location = {Washington DC, USA}, series = {KDD '22} } ``` ### Contributions Thanks to [@dolfim-ibm](https://github.com/dolfim-ibm), [@cau-git](https://github.com/cau-git) for adding this dataset.

提供机构：

ds4sd

原始信息汇总

数据集概述

数据集名称

名称: ICDAR 2023 Competition on Robust Layout Segmentation in Corporate Documents
别名: ICDAR 2023 企业文档鲁棒布局分割竞赛

数据集描述

目的: 用于ICDAR 2023竞赛，旨在推动准确分割各种文档风格和领域布局的研究。
内容: 提供用于训练和评估的文档布局分割数据集。

数据集结构

数据类型: 包含PNG图像、PDF文件和JSON文件。
图像规格: 所有页面图像调整为1025 x 1025px的方形。
额外资源: 每个PNG图像对应的PDF文件和提供数字文本单元及其坐标的JSON文件。

数据集创建

标注过程: 使用DocLayNet_Labeling_Guide_Public.pdf指导标注专家。
标注者: 众包完成。

支持的任务和评估

任务: 对象检测、图像分割。
评估指标: 使用Mean Average Precision (mAP) @ Intersection-over-Union (IoU) [0.50:0.95]作为评估标准。

许可证

许可证: Apache-2.0

引用信息

引用文献: bib @article{doclaynet2022, title = {DocLayNet: A Large Human-Annotated Dataset for Document-Layout Segmentation}, doi = {10.1145/3534678.353904}, url = {https://doi.org/10.1145/3534678.3539043}, author = {Pfitzmann, Birgit and Auer, Christoph and Dolfi, Michele and Nassar, Ahmed S and Staar, Peter W J}, year = {2022}, isbn = {9781450393850}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, booktitle = {Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining}, pages = {3743–3751}, numpages = {9}, location = {Washington DC, USA}, series = {KDD 22} }

贡献者

贡献者: @dolfim-ibm, @cau-git

搜集汇总

数据集介绍

背景与挑战

背景概述

该数据集是ICDAR 2023企业文档鲁棒布局分割竞赛的官方数据集，提供多样化的文档页面图像和布局标注，用于训练和评估文档布局分割模型。数据集包含PNG图像、PDF文件和JSON文件，评估采用COCO标准的mAP@IoU[0.50:0.95]指标。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集