DocLayNet

Name: DocLayNet
Creator: maas
Published: 2026-01-06 16:20:36
License: 暂无描述

魔搭社区2026-01-06 更新2025-01-25 收录

下载链接：

https://modelscope.cn/datasets/swift/DocLayNet

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Card for DocLayNet ## Table of Contents - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Dataset Structure](#dataset-structure) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Annotations](#annotations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** https://developer.ibm.com/exchanges/data/all/doclaynet/ - **Repository:** https://github.com/DS4SD/DocLayNet - **Paper:** https://doi.org/10.1145/3534678.3539043 - **Leaderboard:** - **Point of Contact:** ### Dataset Summary DocLayNet provides page-by-page layout segmentation ground-truth using bounding-boxes for 11 distinct class labels on 80863 unique pages from 6 document categories. It provides several unique features compared to related work such as PubLayNet or DocBank: 1. *Human Annotation*: DocLayNet is hand-annotated by well-trained experts, providing a gold-standard in layout segmentation through human recognition and interpretation of each page layout 2. *Large layout variability*: DocLayNet includes diverse and complex layouts from a large variety of public sources in Finance, Science, Patents, Tenders, Law texts and Manuals 3. *Detailed label set*: DocLayNet defines 11 class labels to distinguish layout features in high detail. 4. *Redundant annotations*: A fraction of the pages in DocLayNet are double- or triple-annotated, allowing to estimate annotation uncertainty and an upper-bound of achievable prediction accuracy with ML models 5. *Pre-defined train- test- and validation-sets*: DocLayNet provides fixed sets for each to ensure proportional representation of the class-labels and avoid leakage of unique layout styles across the sets. ### Supported Tasks and Leaderboards We are hosting a competition in ICDAR 2023 based on the DocLayNet dataset. For more information see https://ds4sd.github.io/icdar23-doclaynet/. ## Dataset Structure ### Data Fields DocLayNet provides four types of data assets: 1. PNG images of all pages, resized to square `1025 x 1025px` 2. Bounding-box annotations in COCO format for each PNG image 3. Extra: Single-page PDF files matching each PNG image 4. Extra: JSON file matching each PDF page, which provides the digital text cells with coordinates and content The COCO image record are defined like this example ```js ... { "id": 1, "width": 1025, "height": 1025, "file_name": "132a855ee8b23533d8ae69af0049c038171a06ddfcac892c3c6d7e6b4091c642.png", // Custom fields: "doc_category": "financial_reports" // high-level document category "collection": "ann_reports_00_04_fancy", // sub-collection name "doc_name": "NASDAQ_FFIN_2002.pdf", // original document filename "page_no": 9, // page number in original document "precedence": 0, // Annotation order, non-zero in case of redundant double- or triple-annotation }, ... ``` The `doc_category` field uses one of the following constants: ``` financial_reports, scientific_articles, laws_and_regulations, government_tenders, manuals, patents ``` ### Data Splits The dataset provides three splits - `train` - `val` - `test` ## Dataset Creation ### Annotations #### Annotation process The labeling guideline used for training of the annotation experts are available at [DocLayNet_Labeling_Guide_Public.pdf](https://raw.githubusercontent.com/DS4SD/DocLayNet/main/assets/DocLayNet_Labeling_Guide_Public.pdf). #### Who are the annotators? Annotations are crowdsourced. ## Additional Information ### Dataset Curators The dataset is curated by the [Deep Search team](https://ds4sd.github.io/) at IBM Research. You can contact us at [deepsearch-core@zurich.ibm.com](mailto:deepsearch-core@zurich.ibm.com). Curators: - Christoph Auer, [@cau-git](https://github.com/cau-git) - Michele Dolfi, [@dolfim-ibm](https://github.com/dolfim-ibm) - Ahmed Nassar, [@nassarofficial](https://github.com/nassarofficial) - Peter Staar, [@PeterStaar-IBM](https://github.com/PeterStaar-IBM) ### Licensing Information License: [CDLA-Permissive-1.0](https://cdla.io/permissive-1-0/) ### Citation Information ```bib @article{doclaynet2022, title = {DocLayNet: A Large Human-Annotated Dataset for Document-Layout Segmentation}, doi = {10.1145/3534678.353904}, url = {https://doi.org/10.1145/3534678.3539043}, author = {Pfitzmann, Birgit and Auer, Christoph and Dolfi, Michele and Nassar, Ahmed S and Staar, Peter W J}, year = {2022}, isbn = {9781450393850}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, booktitle = {Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining}, pages = {3743–3751}, numpages = {9}, location = {Washington DC, USA}, series = {KDD '22} } ``` ### Contributions Thanks to [@dolfim-ibm](https://github.com/dolfim-ibm), [@cau-git](https://github.com/cau-git) for adding this dataset.

# DocLayNet 数据集卡片 ## 目录 - [目录](#table-of-contents) - [数据集描述](#dataset-description) - [数据集概况](#dataset-summary) - [支持任务与评测榜单](#supported-tasks-and-leaderboards) - [数据集结构](#dataset-structure) - [数据字段](#data-fields) - [数据划分](#data-splits) - [数据集构建](#dataset-creation) - [标注信息](#annotations) - [附加信息](#additional-information) - [数据集维护者](#dataset-curators) - [授权信息](#licensing-information) - [引用信息](#citation-information) - [贡献者](#contributions) ## 数据集描述 - **主页**：https://developer.ibm.com/exchanges/data/all/doclaynet/ - **代码仓库**：https://github.com/DS4SD/DocLayNet - **相关论文**：https://doi.org/10.1145/3534678.3539043 - **评测榜单**： - **联络人**： ### 数据集概况 DocLayNet 提供逐页的布局分割（layout segmentation）标注真值，采用边界框（bounding-box）形式，覆盖来自6个文档类别的80863个唯一页面，包含11种不同的类别标签。相较于同类工作如 PubLayNet 或 DocBank，它具备多项独特优势： 1. **人工标注**：DocLayNet 由经过专业培训的专家手动标注，通过对每个页面布局的人工识别与解读，构建了布局分割领域的金标准（gold-standard）。 2. **丰富的布局多样性**：DocLayNet 包含来自金融、科学、专利、招标、法律文本与手册等大量公开来源的多样化复杂布局。 3. **细粒度标签集**：DocLayNet 定义了11个类别标签，可对布局特征进行高精度区分。 4. **冗余标注机制**：DocLayNet 中部分页面采用双标注或三标注，可用于估计标注不确定性，同时为机器学习模型的可达到预测精度提供上限参考。 5. **预定义训练、测试与验证集**：DocLayNet 提供固定的数据集划分，确保类别标签的比例均衡，避免不同划分集之间出现独特布局风格的泄露。 ### 支持任务与评测榜单我们基于DocLayNet数据集举办了ICDAR 2023竞赛，详细信息请参见https://ds4sd.github.io/icdar23-doclaynet/。 ## 数据集结构 ### 数据字段 DocLayNet 包含四类数据资源： 1. 所有页面的PNG图像，均调整为1025×1025像素的正方形尺寸 2. 每张PNG图像对应的COCO（Common Objects in Context）格式边界框标注 3. 额外资源：与每张PNG图像对应的单页PDF文件 4. 额外资源：与每个PDF页面对应的JSON文件，其中包含带坐标与内容的数字化文本单元格 COCO图像记录的定义如下例所示： js ... { "id": 1, "width": 1025, "height": 1025, "file_name": "132a855ee8b23533d8ae69af0049c038171a06ddfcac892c3c6d7e6b4091c642.png", // 自定义字段： "doc_category": "financial_reports" // 高级文档类别 "collection": "ann_reports_00_04_fancy", // 子集合名称 "doc_name": "NASDAQ_FFIN_2002.pdf", // 原始文档文件名 "page_no": 9, // 原始文档中的页码 "precedence": 0, // 标注顺序，若为冗余双标注或三标注则非零 }, ... 其中`doc_category`字段的取值为以下常量之一： financial_reports, // 金融报告 scientific_articles, // 科学论文 laws_and_regulations, // 法律法规 government_tenders, // 政府采购招标 manuals, // 手册 patents // 专利 ### 数据划分该数据集包含三个划分： - `train`（训练集） - `val`（验证集） - `test`（测试集） ## 数据集构建 ### 标注信息 #### 标注过程用于培训标注专家的标注指南可参见[DocLayNet_Labeling_Guide_Public.pdf](https://raw.githubusercontent.com/DS4SD/DocLayNet/main/assets/DocLayNet_Labeling_Guide_Public.pdf)。 #### 标注人员标注工作由众包完成。 ## 附加信息 ### 数据集维护者本数据集由IBM Research旗下的[Deep Search团队](https://ds4sd.github.io/)维护。您可通过[deepsearch-core@zurich.ibm.com](mailto:deepsearch-core@zurich.ibm.com)与我们取得联系。维护者列表： - Christoph Auer，[@cau-git](https://github.com/cau-git) - Michele Dolfi，[@dolfim-ibm](https://github.com/dolfim-ibm) - Ahmed Nassar，[@nassarofficial](https://github.com/nassarofficial) - Peter Staar，[@PeterStaar-IBM](https://github.com/PeterStaar-IBM) ### 授权信息授权协议：[CDLA-Permissive-1.0](https://cdla.io/permissive-1.0/) ### 引用信息 bib @article{doclaynet2022, title = {DocLayNet: 面向文档布局分割的大规模人工标注数据集}, doi = {10.1145/3534678.353904}, url = {https://doi.org/10.1145/3534678.3539043}, author = {Pfitzmann, Birgit and Auer, Christoph and Dolfi, Michele and Nassar, Ahmed S and Staar, Peter W J}, year = {2022}, isbn = {9781450393850}, publisher = {Association for Computing Machinery}, address = {美国纽约州纽约市}, booktitle = {第28届ACM SIGKDD知识发现与数据挖掘大会论文集}, pages = {3743–3751}, numpages = {9}, location = {美国华盛顿特区}, series = {KDD '22} } ### 贡献者感谢[@dolfim-ibm](https://github.com/dolfim-ibm)、[@cau-git](https://github.com/cau-git)为本数据集的收录提供帮助。

提供机构：

maas

创建时间：

2025-01-20

5,000+

优质数据集

54 个

任务类型

进入经典数据集