icdar2023-doclaynet

Name: icdar2023-doclaynet
Creator: maas
Published: 2025-12-05 16:21:09
License: 暂无描述

魔搭社区2025-12-05 更新2025-01-25 收录

下载链接：

https://modelscope.cn/datasets/ds4sd/icdar2023-doclaynet

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Card for ICDAR 2023 Competition on Robust Layout Segmentation in Corporate Documents ## Table of Contents - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Dataset Structure](#dataset-structure) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Annotations](#annotations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** https://ds4sd.github.io/icdar23-doclaynet/ - **Leaderboard:** https://eval.ai/web/challenges/challenge-page/1923/leaderboard - **Point of Contact:** ### Dataset Summary This is the official competition dataset for the _ICDAR 2023 Competition on Robust Layout Segmentation in Corporate Documents_. You are invited to advance the research in accurately segmenting the layout on a broad range of document styles and domains. To achieve this, we challenge you to develop a model that can correctly identify and segment the layout components in document pages as bounding boxes on a competition data-set we provide. For more information see https://ds4sd.github.io/icdar23-doclaynet/. #### Training resources In our recently published [DocLayNet](https://github.com/DS4SD/DocLayNet) dataset, which contains 80k+ human-annotated document pages exposing diverse layouts, we define 11 classes for layout components (paragraphs, headings, tables, figures, lists, mathematical formulas and several more). We encourage you to use this dataset for training and internal evaluation of your solution. Further, you may consider any other publicly available document layout dataset for training (e.g. [PubLayNet](https://github.com/ibm-aur-nlp/PubLayNet), [DocBank](https://github.com/doc-analysis/DocBank)). ### Supported Tasks and Leaderboards This is the official dataset of the ICDAR 2023 Competition on Robust Layout Segmentation in Corporate Documents. For more information see https://ds4sd.github.io/icdar23-doclaynet/. #### Evaluation Metric Your submissions on our [EvalAI challenge](https://eval.ai/web/challenges/challenge-page/1923/) will be evaluated using the Mean Average Precision (mAP) @ Intersection-over-Union (IoU) [0.50:0.95] metric, as used in the [COCO](https://cocodataset.org/) object detection competition. In detail, we will calculate the average precision for a sequence of IoU thresholds ranging from 0.50 to 0.95 with a step size of 0.05. This metric is computed for every document category in the competition-dataset. Then the mean of the average precisions on all categories is computed as the final score. #### Submission We ask you to upload a JSON file in [COCO results format](https://cocodataset.org/#format-results) [here](https://eval.ai/web/challenges/challenge-page/1923/submission), with complete layout bounding-boxes for each page sample. The given `image_id`s must correspond to the ones we publish with the competition data-set's `coco.json`. For each submission you make, the computed mAP will be provided for each category as well as combined. The [leaderboard](https://eval.ai/web/challenges/challenge-page/1923/leaderboard/4545/Total) will be ranked based on the overall mAP. ## Dataset Structure ### Data Fields DocLayNet provides four types of data assets: 1. PNG images of all pages, resized to square `1025 x 1025px` 2. ~~Bounding-box annotations in COCO format for each PNG image~~ (annotations will be released at the end of the competition) 3. Extra: Single-page PDF files matching each PNG image 4. Extra: JSON file matching each PDF page, which provides the digital text cells with coordinates and content The COCO image record are defined like this example ```js ... { "id": 1, "width": 1025, "height": 1025, "file_name": "132a855ee8b23533d8ae69af0049c038171a06ddfcac892c3c6d7e6b4091c642.png", // Custom fields: "doc_category": "financial_reports" // high-level document category "collection": "ann_reports_00_04_fancy", // sub-collection name "doc_name": "NASDAQ_FFIN_2002.pdf", // original document filename "page_no": 9, // page number in original document "precedence": 0, // Annotation order, non-zero in case of redundant double- or triple-annotation }, ... ``` The `doc_category` field uses one of the following constants: ``` reports, manuals, patents, pthers ``` ### Data Splits The dataset provides three splits - `dev`, which is extracted from the [DocLayNet](https://github.com/DS4SD/DocLayNet) dataset - `test`, which contains new data for the competition ## Dataset Creation ### Annotations #### Annotation process The labeling guideline used for training of the annotation experts are available at [DocLayNet_Labeling_Guide_Public.pdf](https://raw.githubusercontent.com/DS4SD/DocLayNet/main/assets/DocLayNet_Labeling_Guide_Public.pdf). #### Who are the annotators? Annotations are crowdsourced. ## Additional Information ### Dataset Curators The dataset is curated by the [Deep Search team](https://ds4sd.github.io/) at IBM Research. You can contact us at [deepsearch-core@zurich.ibm.com](mailto:deepsearch-core@zurich.ibm.com). Curators: - Christoph Auer, [@cau-git](https://github.com/cau-git) - Michele Dolfi, [@dolfim-ibm](https://github.com/dolfim-ibm) - Ahmed Nassar, [@nassarofficial](https://github.com/nassarofficial) - Peter Staar, [@PeterStaar-IBM](https://github.com/PeterStaar-IBM) ### Licensing Information License: [CDLA-Permissive-1.0](https://cdla.io/permissive-1-0/) ### Citation Information A publication will be submitted at the end of the competition. Meanwhile, we suggest the cite our original dataset paper. ```bib @article{doclaynet2022, title = {DocLayNet: A Large Human-Annotated Dataset for Document-Layout Segmentation}, doi = {10.1145/3534678.353904}, url = {https://doi.org/10.1145/3534678.3539043}, author = {Pfitzmann, Birgit and Auer, Christoph and Dolfi, Michele and Nassar, Ahmed S and Staar, Peter W J}, year = {2022}, isbn = {9781450393850}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, booktitle = {Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining}, pages = {3743–3751}, numpages = {9}, location = {Washington DC, USA}, series = {KDD '22} } ``` ### Contributions Thanks to [@dolfim-ibm](https://github.com/dolfim-ibm), [@cau-git](https://github.com/cau-git) for adding this dataset.

# ICDAR 2023企业文档鲁棒布局分割竞赛数据集卡片 ## 目录 - [目录](#目录) - [数据集描述](#数据集描述) - [数据集概况](#数据集概况) - [支持任务与排行榜](#支持任务与排行榜) - [数据集结构](#数据集结构) - [数据字段](#数据字段) - [数据划分](#数据划分) - [数据集构建](#数据集构建) - [标注信息](#标注信息) - [附加信息](#附加信息) - [数据集维护者](#数据集维护者) - [许可信息](#许可信息) - [引用信息](#引用信息) - [贡献者](#贡献者) ## 数据集描述 - **主页**：https://ds4sd.github.io/icdar23-doclaynet/ - **排行榜**：https://eval.ai/web/challenges/challenge-page/1923/leaderboard - **联系人**： ### 数据集概况本数据集为**2023年国际文档分析与识别大会（ICDAR 2023）企业文档鲁棒布局分割竞赛**的官方竞赛数据集。本次竞赛旨在推动针对多样文档样式与领域的精准布局分割研究，我们邀请参赛者开发可在竞赛提供的数据集上，准确识别并以边界框形式分割文档页面布局组件的模型。更多详情请访问：https://ds4sd.github.io/icdar23-doclaynet/ #### 训练资源我们在近期发布的[DocLayNet](https://github.com/DS4SD/DocLayNet)数据集（包含8万余张经人工标注的布局多样的文档页面）中，定义了11类布局组件类别，涵盖段落、标题、表格、图片、列表、数学公式等。我们鼓励参赛者使用该数据集进行模型训练与内部评估。此外，参赛者也可选用其他公开可用的文档布局数据集进行训练，例如[PubLayNet](https://github.com/ibm-aur-nlp/PubLayNet)、[DocBank](https://github.com/doc-analysis/DocBank)。 ### 支持任务与排行榜本数据集为2023年ICDAR企业文档鲁棒布局分割竞赛的官方数据集。更多详情请访问：https://ds4sd.github.io/icdar23-doclaynet/ #### 评估指标参赛者在[EvalAI竞赛平台](https://eval.ai/web/challenges/challenge-page/1923/)提交的结果将采用**平均精度均值（mean Average Precision, mAP）@交并比（Intersection over Union, IoU）[0.50:0.95]** 进行评估，该指标与[COCO（Common Objects in Context）目标检测竞赛](https://cocodataset.org/)所使用的评估规则一致。具体而言，我们将在IoU阈值从0.50到0.95、步长为0.05的序列上计算平均精度，针对竞赛数据集的每一个文档类别单独计算该指标，最终得分为所有类别平均精度的均值。 #### 提交要求请参赛者上传符合[COCO结果格式](https://cocodataset.org/#format-results)的JSON文件至[提交页面](https://eval.ai/web/challenges/challenge-page/1923/submission)，文件需包含每个页面样本的完整布局边界框信息。提交文件中的`image_id`必须与竞赛数据集附带的`coco.json`中的对应字段保持一致。每次提交后，系统将返回各分类别及整体的mAP计算结果，排行榜将依据整体mAP进行排名。 ## 数据集结构 ### 数据字段 DocLayNet提供四类数据资源： 1. 所有页面的PNG图片，已统一调整为`1025 × 1025`像素的正方形尺寸 2. ~~每张PNG图片对应的COCO格式边界框标注~~（标注将在竞赛结束后发布） 3. 附加资源：与每张PNG图片对应的单页PDF文件 4. 附加资源：与每个PDF页面匹配的JSON文件，包含带坐标与内容的数字化文本单元格信息 COCO图像记录的定义示例如下： js ... { "id": 1, "width": 1025, "height": 1025, "file_name": "132a855ee8b23533d8ae69af0049c038171a06ddfcac892c3c6d7e6b4091c642.png", // 自定义字段： "doc_category": "financial_reports" // 高级文档类别 "collection": "ann_reports_00_04_fancy", // 子集合名称 "doc_name": "NASDAQ_FFIN_2002.pdf", // 原始文档文件名 "page_no": 9, // 原始文档中的页码 "precedence": 0, // 标注顺序，若存在重复的双重或三重标注则为非零值 }, ... `doc_category`字段可使用以下固定取值： reports, manuals, patents, others ### 数据划分本数据集包含三个划分集： - `dev`集：从[DocLayNet](https://github.com/DS4SD/DocLayNet)数据集中提取得到 - `test`集：竞赛专用的全新数据 ## 数据集构建 ### 标注信息 #### 标注流程用于培训标注专家的标注指南可在[DocLayNet_Labeling_Guide_Public.pdf](https://raw.githubusercontent.com/DS4SD/DocLayNet/main/assets/DocLayNet_Labeling_Guide_Public.pdf)获取。 #### 标注人员来源标注工作由众包完成。 ## 附加信息 ### 数据集维护者本数据集由IBM研究院[Deep Search团队](https://ds4sd.github.io/)维护。可通过邮箱[deepsearch-core@zurich.ibm.com](mailto:deepsearch-core@zurich.ibm.com)联系我们。维护者名单： - Christoph Auer, [@cau-git](https://github.com/cau-git) - Michele Dolfi, [@dolfim-ibm](https://github.com/dolfim-ibm) - Ahmed Nassar, [@nassarofficial](https://github.com/nassarofficial) - Peter Staar, [@PeterStaar-IBM](https://github.com/PeterStaar-IBM) ### 许可信息许可协议：[CDLA-Permissive-1.0](https://cdla.io/permissive-1.0/) ### 引用信息竞赛结束后将提交正式出版物。在此之前，建议引用本数据集的原始论文： bib @article{doclaynet2022, title = {DocLayNet: 面向文档布局分割的大规模人工标注数据集}, doi = {10.1145/3534678.353904}, url = {https://doi.org/10.1145/3534678.3539043}, author = {Pfitzmann, Birgit and Auer, Christoph and Dolfi, Michele and Nassar, Ahmed S and Staar, Peter W J}, year = {2022}, isbn = {9781450393850}, publisher = {Association for Computing Machinery}, address = {美国纽约}, booktitle = {第28届ACM SIGKDD知识发现与数据挖掘大会论文集}, pages = {3743–3751}, numpages = {9}, location = {美国华盛顿特区}, series = {KDD '22} } ### 贡献者感谢[@dolfim-ibm](https://github.com/dolfim-ibm)、[@cau-git](https://github.com/cau-git)为本数据集的录入提供帮助。

提供机构：

maas

创建时间：

2025-01-20

搜集汇总

数据集介绍