DocLayNet-v1.2
收藏魔搭社区2026-01-06 更新2025-02-15 收录
下载链接:
https://modelscope.cn/datasets/ds4sd/DocLayNet-v1.2
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for DocLayNet v1.2
## Dataset Description
- **Homepage:** https://developer.ibm.com/exchanges/data/all/doclaynet/
- **Repository:** https://github.com/DS4SD/DocLayNet
- **Paper:** https://doi.org/10.1145/3534678.3539043
### Dataset Summary
This dataset is an extention of the [original DocLayNet dataset](https://github.com/DS4SD/DocLayNet) which embeds the PDF files of the document images inside a binary column.
DocLayNet provides page-by-page layout segmentation ground-truth using bounding-boxes for 11 distinct class labels on 80863 unique pages from 6 document categories. It provides several unique features compared to related work such as PubLayNet or DocBank:
1. *Human Annotation*: DocLayNet is hand-annotated by well-trained experts, providing a gold-standard in layout segmentation through human recognition and interpretation of each page layout
2. *Large layout variability*: DocLayNet includes diverse and complex layouts from a large variety of public sources in Finance, Science, Patents, Tenders, Law texts and Manuals
3. *Detailed label set*: DocLayNet defines 11 class labels to distinguish layout features in high detail.
4. *Redundant annotations*: A fraction of the pages in DocLayNet are double- or triple-annotated, allowing to estimate annotation uncertainty and an upper-bound of achievable prediction accuracy with ML models
5. *Pre-defined train- test- and validation-sets*: DocLayNet provides fixed sets for each to ensure proportional representation of the class-labels and avoid leakage of unique layout styles across the sets.
## Dataset Structure
This dataset is structured differently from the other repository [ds4sd/DocLayNet](https://huggingface.co/datasets/ds4sd/DocLayNet), as this one includes the content (PDF cells) of the detections, and abandons the COCO format.
* `image`: page PIL image.
* `bboxes`: a list of layout bounding boxes.
* `category_id`: a list of class ids corresponding to the bounding boxes.
* `segmentation`: a list of layout segmentation polygons.
* `area`: Area of the bboxes.
* `pdf_cells`: a list of lists corresponding to `bbox`. Each list contains the PDF cells (content) inside the bbox.
* `metadata`: page and document metadetails.
* `pdf`: Binary blob with the original PDF image.
Bounding boxes classes / categories:
```
1: Caption
2: Footnote
3: Formula
4: List-item
5: Page-footer
6: Page-header
7: Picture
8: Section-header
9: Table
10: Text
11: Title
```
The `["metadata"]["doc_category"]` field uses one of the following constants:
```
* financial_reports,
* scientific_articles,
* laws_and_regulations,
* government_tenders,
* manuals,
* patents
```
### Data Splits
The dataset provides three splits
- `train`
- `val`
- `test`
## Dataset Creation
### Annotations
#### Annotation process
The labeling guideline used for training of the annotation experts are available at [DocLayNet_Labeling_Guide_Public.pdf](https://raw.githubusercontent.com/DS4SD/DocLayNet/main/assets/DocLayNet_Labeling_Guide_Public.pdf).
#### Who are the annotators?
Annotations are crowdsourced.
## Additional Information
### Dataset Curators
The dataset is curated by the [Deep Search team](https://ds4sd.github.io/) at IBM Research.
You can contact us at [deepsearch-core@zurich.ibm.com](mailto:deepsearch-core@zurich.ibm.com).
Curators:
- Christoph Auer, [@cau-git](https://github.com/cau-git)
- Michele Dolfi, [@dolfim-ibm](https://github.com/dolfim-ibm)
- Ahmed Nassar, [@nassarofficial](https://github.com/nassarofficial)
- Peter Staar, [@PeterStaar-IBM](https://github.com/PeterStaar-IBM)
### Licensing Information
License: [CDLA-Permissive-1.0](https://cdla.io/permissive-1-0/)
### Citation Information
```bib
@article{doclaynet2022,
title = {DocLayNet: A Large Human-Annotated Dataset for Document-Layout Segmentation},
doi = {10.1145/3534678.353904},
url = {https://doi.org/10.1145/3534678.3539043},
author = {Pfitzmann, Birgit and Auer, Christoph and Dolfi, Michele and Nassar, Ahmed S and Staar, Peter W J},
year = {2022},
isbn = {9781450393850},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
booktitle = {Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining},
pages = {3743–3751},
numpages = {9},
location = {Washington DC, USA},
series = {KDD '22}
}
```
# DocLayNet v1.2 数据集卡片
## 数据集描述
- **主页**:https://developer.ibm.com/exchanges/data/all/doclaynet/
- **代码仓库**:https://github.com/DS4SD/DocLayNet
- **论文**:https://doi.org/10.1145/3534678.3539043
### 数据集概述
本数据集是[原始DocLayNet数据集](https://github.com/DS4SD/DocLayNet)的扩展版本,将文档图像对应的PDF文件嵌入至二进制列中。
DocLayNet提供逐页的布局分割真值标注,针对6个文档类别下的80863个唯一页面,使用边界框(bounding box)为11个不同类别标签提供标注。相较于PubLayNet、DocBank等相关研究工作,本数据集具备多项独特优势:
1. **人工标注**:DocLayNet由经过专业培训的专家手工完成标注,通过对每个页面布局的人工识别与解读,打造文档布局分割领域的金标准;
2. **丰富布局变体**:涵盖来自金融、科研、专利、标书、法律文本与手册等多类公开来源的多样化复杂布局;
3. **精细标签集**:定义11个类别标签,以高细粒度区分各类布局特征;
4. **冗余标注机制**:部分页面经过双标注或三标注,可用于评估标注不确定性,并估算机器学习模型可达到的预测精度上限;
5. **预定义数据集划分**:提供固定的训练、测试与验证集,确保类别标签的比例均衡分布,避免不同划分集中出现独特布局样式导致的数据泄露。
## 数据集结构
本数据集与[ds4sd/DocLayNet](https://huggingface.co/datasets/ds4sd/DocLayNet)仓库的结构存在差异:本版本包含检测结果对应的PDF单元格内容,且摒弃了COCO格式。各字段说明如下:
* `image`:页面的PIL图像(PIL image)
* `bboxes`:布局边界框(bounding box)列表
* `category_id`:与边界框对应的类别ID列表
* `segmentation`:布局分割多边形列表
* `area`:边界框的面积
* `pdf_cells`:与`bbox`对应的嵌套列表,每个子列表包含对应边界框内的PDF单元格内容
* `metadata`:页面与文档元数据
* `pdf`:包含原始PDF文件的二进制大对象
边界框类别/标签对应关系如下:
1: 说明性文字(Caption)
2: 脚注(Footnote)
3: 公式(Formula)
4: 列表项(List-item)
5: 页脚(Page-footer)
6: 页眉(Page-header)
7: 图片(Picture)
8: 章节标题(Section-header)
9: 表格(Table)
10: 正文(Text)
11: 文档标题(Title)
`["metadata"]["doc_category"]`字段使用以下常量之一:
* financial_reports → 财务报告
* scientific_articles → 学术期刊文章
* laws_and_regulations → 法律法规
* government_tenders → 政府采购标书
* manuals → 操作手册
* patents → 专利文档
### 数据划分
本数据集提供三类划分:
- `train`:训练集
- `val`:验证集
- `test`:测试集
## 数据集构建
### 标注信息
#### 标注流程
用于培训标注专家的标注指南可参见[DocLayNet_Labeling_Guide_Public.pdf](https://raw.githubusercontent.com/DS4SD/DocLayNet/main/assets/DocLayNet_Labeling_Guide_Public.pdf)。
#### 标注人员构成
标注工作采用众包形式完成。
## 附加信息
### 数据集维护者
本数据集由IBM研究院的[Deep Search团队](https://ds4sd.github.io/)整理维护。可通过[deepsearch-core@zurich.ibm.com](mailto:deepsearch-core@zurich.ibm.com)联系我们。
维护者列表:
- Christoph Auer, [@cau-git](https://github.com/cau-git)
- Michele Dolfi, [@dolfim-ibm](https://github.com/dolfim-ibm)
- Ahmed Nassar, [@nassarofficial](https://github.com/nassarofficial)
- Peter Staar, [@PeterStaar-IBM](https://github.com/PeterStaar-IBM)
### 许可协议
许可协议:[CDLA-Permissive-1.0](https://cdla.io/permissive-1.0/)
### 引用信息
bib
@article{doclaynet2022,
title = {DocLayNet:面向文档布局分割的大规模人工标注数据集},
doi = {10.1145/3534678.353904},
url = {https://doi.org/10.1145/3534678.3539043},
author = {Pfitzmann, Birgit and Auer, Christoph and Dolfi, Michele and Nassar, Ahmed S and Staar, Peter W J},
year = {2022},
isbn = {9781450393850},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
booktitle = {Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining},
pages = {3743–3751},
numpages = {9},
location = {Washington DC, USA},
series = {KDD '22}
}
提供机构:
maas
创建时间:
2025-02-11
搜集汇总
数据集介绍

背景与挑战
背景概述
DocLayNet-v1.2是一个包含80863个独特页面的文档布局分割数据集,涵盖6种文档类别和11个类标签。其特点包括人工标注、多样化的布局、详细的标签集和预定义的数据集划分,适用于文档布局分析和机器学习模型训练。
以上内容由遇见数据集搜集并总结生成



