DocLayNet
收藏魔搭社区2026-01-06 更新2025-01-25 收录
下载链接:
https://modelscope.cn/datasets/swift/DocLayNet
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for DocLayNet
## Table of Contents
- [Table of Contents](#table-of-contents)
- [Dataset Description](#dataset-description)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Dataset Structure](#dataset-structure)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Annotations](#annotations)
- [Additional Information](#additional-information)
- [Dataset Curators](#dataset-curators)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
- [Contributions](#contributions)
## Dataset Description
- **Homepage:** https://developer.ibm.com/exchanges/data/all/doclaynet/
- **Repository:** https://github.com/DS4SD/DocLayNet
- **Paper:** https://doi.org/10.1145/3534678.3539043
- **Leaderboard:**
- **Point of Contact:**
### Dataset Summary
DocLayNet provides page-by-page layout segmentation ground-truth using bounding-boxes for 11 distinct class labels on 80863 unique pages from 6 document categories. It provides several unique features compared to related work such as PubLayNet or DocBank:
1. *Human Annotation*: DocLayNet is hand-annotated by well-trained experts, providing a gold-standard in layout segmentation through human recognition and interpretation of each page layout
2. *Large layout variability*: DocLayNet includes diverse and complex layouts from a large variety of public sources in Finance, Science, Patents, Tenders, Law texts and Manuals
3. *Detailed label set*: DocLayNet defines 11 class labels to distinguish layout features in high detail.
4. *Redundant annotations*: A fraction of the pages in DocLayNet are double- or triple-annotated, allowing to estimate annotation uncertainty and an upper-bound of achievable prediction accuracy with ML models
5. *Pre-defined train- test- and validation-sets*: DocLayNet provides fixed sets for each to ensure proportional representation of the class-labels and avoid leakage of unique layout styles across the sets.
### Supported Tasks and Leaderboards
We are hosting a competition in ICDAR 2023 based on the DocLayNet dataset. For more information see https://ds4sd.github.io/icdar23-doclaynet/.
## Dataset Structure
### Data Fields
DocLayNet provides four types of data assets:
1. PNG images of all pages, resized to square `1025 x 1025px`
2. Bounding-box annotations in COCO format for each PNG image
3. Extra: Single-page PDF files matching each PNG image
4. Extra: JSON file matching each PDF page, which provides the digital text cells with coordinates and content
The COCO image record are defined like this example
```js
...
{
"id": 1,
"width": 1025,
"height": 1025,
"file_name": "132a855ee8b23533d8ae69af0049c038171a06ddfcac892c3c6d7e6b4091c642.png",
// Custom fields:
"doc_category": "financial_reports" // high-level document category
"collection": "ann_reports_00_04_fancy", // sub-collection name
"doc_name": "NASDAQ_FFIN_2002.pdf", // original document filename
"page_no": 9, // page number in original document
"precedence": 0, // Annotation order, non-zero in case of redundant double- or triple-annotation
},
...
```
The `doc_category` field uses one of the following constants:
```
financial_reports,
scientific_articles,
laws_and_regulations,
government_tenders,
manuals,
patents
```
### Data Splits
The dataset provides three splits
- `train`
- `val`
- `test`
## Dataset Creation
### Annotations
#### Annotation process
The labeling guideline used for training of the annotation experts are available at [DocLayNet_Labeling_Guide_Public.pdf](https://raw.githubusercontent.com/DS4SD/DocLayNet/main/assets/DocLayNet_Labeling_Guide_Public.pdf).
#### Who are the annotators?
Annotations are crowdsourced.
## Additional Information
### Dataset Curators
The dataset is curated by the [Deep Search team](https://ds4sd.github.io/) at IBM Research.
You can contact us at [deepsearch-core@zurich.ibm.com](mailto:deepsearch-core@zurich.ibm.com).
Curators:
- Christoph Auer, [@cau-git](https://github.com/cau-git)
- Michele Dolfi, [@dolfim-ibm](https://github.com/dolfim-ibm)
- Ahmed Nassar, [@nassarofficial](https://github.com/nassarofficial)
- Peter Staar, [@PeterStaar-IBM](https://github.com/PeterStaar-IBM)
### Licensing Information
License: [CDLA-Permissive-1.0](https://cdla.io/permissive-1-0/)
### Citation Information
```bib
@article{doclaynet2022,
title = {DocLayNet: A Large Human-Annotated Dataset for Document-Layout Segmentation},
doi = {10.1145/3534678.353904},
url = {https://doi.org/10.1145/3534678.3539043},
author = {Pfitzmann, Birgit and Auer, Christoph and Dolfi, Michele and Nassar, Ahmed S and Staar, Peter W J},
year = {2022},
isbn = {9781450393850},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
booktitle = {Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining},
pages = {3743–3751},
numpages = {9},
location = {Washington DC, USA},
series = {KDD '22}
}
```
### Contributions
Thanks to [@dolfim-ibm](https://github.com/dolfim-ibm), [@cau-git](https://github.com/cau-git) for adding this dataset.
# DocLayNet 数据集卡片
## 目录
- [目录](#table-of-contents)
- [数据集描述](#dataset-description)
- [数据集概况](#dataset-summary)
- [支持任务与评测榜单](#supported-tasks-and-leaderboards)
- [数据集结构](#dataset-structure)
- [数据字段](#data-fields)
- [数据划分](#data-splits)
- [数据集构建](#dataset-creation)
- [标注信息](#annotations)
- [附加信息](#additional-information)
- [数据集维护者](#dataset-curators)
- [授权信息](#licensing-information)
- [引用信息](#citation-information)
- [贡献者](#contributions)
## 数据集描述
- **主页**:https://developer.ibm.com/exchanges/data/all/doclaynet/
- **代码仓库**:https://github.com/DS4SD/DocLayNet
- **相关论文**:https://doi.org/10.1145/3534678.3539043
- **评测榜单**:
- **联络人**:
### 数据集概况
DocLayNet 提供逐页的布局分割(layout segmentation)标注真值,采用边界框(bounding-box)形式,覆盖来自6个文档类别的80863个唯一页面,包含11种不同的类别标签。相较于同类工作如 PubLayNet 或 DocBank,它具备多项独特优势:
1. **人工标注**:DocLayNet 由经过专业培训的专家手动标注,通过对每个页面布局的人工识别与解读,构建了布局分割领域的金标准(gold-standard)。
2. **丰富的布局多样性**:DocLayNet 包含来自金融、科学、专利、招标、法律文本与手册等大量公开来源的多样化复杂布局。
3. **细粒度标签集**:DocLayNet 定义了11个类别标签,可对布局特征进行高精度区分。
4. **冗余标注机制**:DocLayNet 中部分页面采用双标注或三标注,可用于估计标注不确定性,同时为机器学习模型的可达到预测精度提供上限参考。
5. **预定义训练、测试与验证集**:DocLayNet 提供固定的数据集划分,确保类别标签的比例均衡,避免不同划分集之间出现独特布局风格的泄露。
### 支持任务与评测榜单
我们基于DocLayNet数据集举办了ICDAR 2023竞赛,详细信息请参见https://ds4sd.github.io/icdar23-doclaynet/。
## 数据集结构
### 数据字段
DocLayNet 包含四类数据资源:
1. 所有页面的PNG图像,均调整为1025×1025像素的正方形尺寸
2. 每张PNG图像对应的COCO(Common Objects in Context)格式边界框标注
3. 额外资源:与每张PNG图像对应的单页PDF文件
4. 额外资源:与每个PDF页面对应的JSON文件,其中包含带坐标与内容的数字化文本单元格
COCO图像记录的定义如下例所示:
js
...
{
"id": 1,
"width": 1025,
"height": 1025,
"file_name": "132a855ee8b23533d8ae69af0049c038171a06ddfcac892c3c6d7e6b4091c642.png",
// 自定义字段:
"doc_category": "financial_reports" // 高级文档类别
"collection": "ann_reports_00_04_fancy", // 子集合名称
"doc_name": "NASDAQ_FFIN_2002.pdf", // 原始文档文件名
"page_no": 9, // 原始文档中的页码
"precedence": 0, // 标注顺序,若为冗余双标注或三标注则非零
},
...
其中`doc_category`字段的取值为以下常量之一:
financial_reports, // 金融报告
scientific_articles, // 科学论文
laws_and_regulations, // 法律法规
government_tenders, // 政府采购招标
manuals, // 手册
patents // 专利
### 数据划分
该数据集包含三个划分:
- `train`(训练集)
- `val`(验证集)
- `test`(测试集)
## 数据集构建
### 标注信息
#### 标注过程
用于培训标注专家的标注指南可参见[DocLayNet_Labeling_Guide_Public.pdf](https://raw.githubusercontent.com/DS4SD/DocLayNet/main/assets/DocLayNet_Labeling_Guide_Public.pdf)。
#### 标注人员
标注工作由众包完成。
## 附加信息
### 数据集维护者
本数据集由IBM Research旗下的[Deep Search团队](https://ds4sd.github.io/)维护。
您可通过[deepsearch-core@zurich.ibm.com](mailto:deepsearch-core@zurich.ibm.com)与我们取得联系。
维护者列表:
- Christoph Auer,[@cau-git](https://github.com/cau-git)
- Michele Dolfi,[@dolfim-ibm](https://github.com/dolfim-ibm)
- Ahmed Nassar,[@nassarofficial](https://github.com/nassarofficial)
- Peter Staar,[@PeterStaar-IBM](https://github.com/PeterStaar-IBM)
### 授权信息
授权协议:[CDLA-Permissive-1.0](https://cdla.io/permissive-1.0/)
### 引用信息
bib
@article{doclaynet2022,
title = {DocLayNet: 面向文档布局分割的大规模人工标注数据集},
doi = {10.1145/3534678.353904},
url = {https://doi.org/10.1145/3534678.3539043},
author = {Pfitzmann, Birgit and Auer, Christoph and Dolfi, Michele and Nassar, Ahmed S and Staar, Peter W J},
year = {2022},
isbn = {9781450393850},
publisher = {Association for Computing Machinery},
address = {美国纽约州纽约市},
booktitle = {第28届ACM SIGKDD知识发现与数据挖掘大会论文集},
pages = {3743–3751},
numpages = {9},
location = {美国华盛顿特区},
series = {KDD '22}
}
### 贡献者
感谢[@dolfim-ibm](https://github.com/dolfim-ibm)、[@cau-git](https://github.com/cau-git)为本数据集的收录提供帮助。
提供机构:
maas
创建时间:
2025-01-20



