kbang2021/doclaynet-6class
收藏Hugging Face2026-03-23 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/kbang2021/doclaynet-6class
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cdla-permissive-2.0
task_categories:
- object-detection
- document-layout-analysis
tags:
- document-ai
- layout-analysis
- object-detection
- doclaynet
- filtered
size_categories:
- 10K<n<100K
---
# DocLayNet 6-Class Filtered Dataset
## Dataset Description
This is a filtered version of the [DocLayNet dataset](https://huggingface.co/datasets/docling-project/DocLayNet) containing only 6 most relevant layout element classes for document layout analysis tasks.
### Original Dataset
DocLayNet is a human-annotated document layout segmentation dataset containing 80,863 pages from diverse sources with 11 distinct layout categories.
**Citation:**
```
@article{doclaynet2022,
title = {DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis},
author = {Pfitzmann, Birgit and Auer, Christoph and Dolfi, Michele and Nassar, Ahmed S and Staar, Peter W J},
year = {2022},
doi = {10.1145/3534678.3539043},
}
```
### Filtering Methodology
**Classes Retained (6):**
1. **Text** - Body text paragraphs
2. **List-item** - List elements (bulleted, numbered)
3. **Section-header** - Section and subsection titles
4. **Picture** - Images, figures, diagrams
5. **Table** - Tabular data structures
6. **Caption** - Image and table captions
**Classes Removed (5):**
- Footnote
- Formula
- Page-footer
- Page-header
- Title
**Rationale:** Focus on the most common and semantically important layout elements for general document understanding tasks. The 6 retained classes represent 85.1% of all annotations in the original dataset.
## Dataset Statistics
### Split Distribution
| Split | Images | Annotations | Classes |
|-------|--------|-------------|---------|
| Train | 68,673 | 800,614 | 6 |
| Validation | 6,446 | 85,057 | 6 |
| Test | 4,952 | 56,483 | 6 |
| **Total** | **80,071** | **942,154** | **6** |
### Class Distribution (Training Set)
Based on 800,614 annotations:
| Class ID | Class Name | Count | Percentage |
|----------|------------|-------|------------|
| 0 | Caption | 19,218 | 2.4% |
| 1 | List-item | 161,818 | 20.2% |
| 2 | Picture | 39,667 | 5.0% |
| 3 | Section-header | 118,590 | 14.8% |
| 4 | Table | 30,070 | 3.8% |
| 5 | Text | 431,251 | 53.9% |
### Retention from Original Dataset
- **Images retained:** 99.0%
- **Annotations retained:** 85.1%
## Dataset Structure
### Format
Annotations are provided in **COCO JSON format**:
```
DocLayNet_6class/
├── coco/
│ ├── train.json # Training annotations
│ ├── val.json # Validation annotations
│ └── test.json # Test annotations
└── README.md # This file
```
Images are **NOT included** - use the original DocLayNet image files from:
- HuggingFace: `docling-project/DocLayNet`
- Official source: https://github.com/DS4SD/DocLayNet
### Loading the Dataset
#### Using HuggingFace Datasets
```python
from datasets import load_dataset
# Load the filtered annotations
dataset = load_dataset("kbang2021/doclaynet-6class")
# Access splits
train_data = dataset["train"]
val_data = dataset["validation"]
test_data = dataset["test"]
```
#### Manual Loading
```python
import json
from pathlib import Path
# Load COCO annotations
with open("coco/train.json") as f:
train_coco = json.load(f)
# Categories
categories = train_coco["categories"] # 6 classes with IDs 0-5
# Images
images = train_coco["images"] # Image metadata
# Annotations
annotations = train_coco["annotations"] # Bounding boxes
```
### Annotation Format
Each annotation follows the COCO format:
```json
{
"id": 12345,
"image_id": 123,
"category_id": 5, // 0-5 (remapped from original 11 classes)
"bbox": [x_min, y_min, width, height], // In pixels
"area": 12345.67,
"iscrowd": 0
}
```
### Category Mapping
Original DocLayNet → 6-Class Filtered:
| Original ID | Original Name | Filtered ID | Filtered Name | Status |
|-------------|---------------|-------------|---------------|--------|
| 0 | Caption | 0 | Caption | ✅ Kept |
| 1 | Footnote | - | - | ❌ Removed |
| 2 | Formula | - | - | ❌ Removed |
| 3 | List-item | 1 | List-item | ✅ Kept |
| 4 | Page-footer | - | - | ❌ Removed |
| 5 | Page-header | - | - | ❌ Removed |
| 6 | Picture | 2 | Picture | ✅ Kept |
| 7 | Section-header | 3 | Section-header | ✅ Kept |
| 8 | Table | 4 | Table | ✅ Kept |
| 9 | Text | 5 | Text | ✅ Kept |
| 10 | Title | - | - | ❌ Removed |
## Use Cases
This filtered dataset is ideal for:
- **Document layout analysis** with focus on content structure
- **Information extraction** from documents (text, tables, figures)
- **Object detection** model training for document AI
- **Multi-scale document understanding** tasks
- **Transfer learning** from general object detection to document analysis
## Limitations
1. **Images not included**: You must obtain images from the original DocLayNet dataset
2. **Class imbalance**: Text class dominates (53.9% of annotations)
3. **Domain specific**: Focused on document layout, may not generalize to other domains
4. **Annotation quality**: Inherits any annotation errors from original DocLayNet
## Ethical Considerations
- Dataset maintains the original DocLayNet license (CDLA-Permissive-2.0)
- No personal or sensitive information in annotations
- Source documents from diverse domains (financial, scientific, patents, manuals)
- Should not be used to discriminate based on document type or origin
## Citation
If you use this filtered dataset, please cite both:
1. **Original DocLayNet paper:**
```bibtex
@article{doclaynet2022,
title = {DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis},
author = {Pfitzmann, Birgit and Auer, Christoph and Dolfi, Michele and Nassar, Ahmed S and Staar, Peter W J},
year = {2022},
doi = {10.1145/3534678.3539043},
}
```
2. **This filtered version:**
```bibtex
@misc{doclaynet6class2024,
title = {DocLayNet 6-Class: Filtered Document Layout Analysis Dataset},
author = {[Keng Boon, Ang]},
year = {2026},
howpublished = {\url{https://huggingface.co/datasets/kbang2021/doclaynet-6class}},
note = {Filtered subset of DocLayNet containing 6 primary layout element classes}
}
```
## License
This filtered dataset maintains the original license:
**CDLA-Permissive-2.0** (Community Data License Agreement – Permissive – Version 2.0)
See: https://cdla.dev/permissive-2-0/
## Acknowledgments
- Original DocLayNet dataset: IBM Research
- Built using the layout-for-tools evaluation framework
## Contact
For questions or issues with this filtered dataset, please open an issue on the repository.
For questions about the original DocLayNet dataset, see: https://github.com/DS4SD/DocLayNet
---
许可证:cdla-permissive-2.0
任务类别:
- 目标检测(object-detection)
- 文档版面分析(document-layout-analysis)
标签:
- 文档人工智能(document-ai)
- 版面分析(layout-analysis)
- 目标检测(object-detection)
- DocLayNet
- 过滤版
规模类别:
- 10K<n<100K
---
# DocLayNet 6分类过滤数据集
## 数据集说明
本数据集为[DocLayNet数据集](https://huggingface.co/datasets/docling-project/DocLayNet)的过滤版本,仅保留了与文档版面分析任务最相关的6种版面元素类别。
## 原始数据集
DocLayNet是一个经人工标注的文档版面分割数据集,包含来自多样化来源的80863页文档,共涵盖11种不同的版面类别。
**引用文献:**
@article{doclaynet2022,
title = {DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis},
author = {Pfitzmann, Birgit and Auer, Christoph and Dolfi, Michele and Nassar, Ahmed S and Staar, Peter W J},
year = {2022},
doi = {10.1145/3534678.3539043},
}
## 过滤方法
### 保留类别(共6类):
1. **文本(Text)**:正文文本段落
2. **列表项(List-item)**:列表元素(含项目符号列表、编号列表)
3. **章节标题(Section-header)**:章节及子章节标题
4. **图片(Picture)**:图像、图表、示意图
5. **表格(Table)**:表格数据结构
6. **图注/表注(Caption)**:图像与表格的标注文字
### 移除类别(共5类):
- 脚注(Footnote)
- 公式(Formula)
- 页脚(Page-footer)
- 页眉(Page-header)
- 标题(Title)
### 过滤依据:
本数据集聚焦于通用文档理解任务中最常见且语义上最重要的版面元素。保留的6个类别覆盖了原始数据集中85.1%的标注样本。
## 数据集统计
### 数据集划分分布
| 划分 | 图像数 | 标注数 | 类别数 |
|-------|--------|-------------|---------|
| 训练集 | 68,673 | 800,614 | 6 |
| 验证集 | 6,446 | 85,057 | 6 |
| 测试集 | 4,952 | 56,483 | 6 |
| **总计** | **80,071** | **942,154** | **6** |
### 训练集类别分布
基于800,614条标注:
| 类别ID | 类别名称 | 标注数 | 占比 |
|----------|------------|-------|------------|
| 0 | 图注/表注 | 19,218 | 2.4% |
| 1 | 列表项 | 161,818 | 20.2% |
| 2 | 图片 | 39,667 | 5.0% |
| 3 | 章节标题 | 118,590 | 14.8% |
| 4 | 表格 | 30,070 | 3.8% |
| 5 | 文本 | 431,251 | 53.9% |
### 原始数据集保留比例
- **图像保留率:99.0%**
- **标注保留率:85.1%**
## 数据集结构
### 数据格式
标注采用**COCO JSON格式(COCO JSON format)**提供:
DocLayNet_6class/
├── coco/
│ ├── train.json # 训练标注
│ ├── val.json # 验证标注
│ └── test.json # 测试标注
└── README.md # 本文件
本数据集**不包含图像文件**,请从以下渠道获取原始DocLayNet图像文件:
- HuggingFace: `docling-project/DocLayNet`
- 官方来源:https://github.com/DS4SD/DocLayNet
### 数据集加载
#### 使用HuggingFace Datasets库加载
python
from datasets import load_dataset
# 加载过滤后的标注
dataset = load_dataset("kbang2021/doclaynet-6class")
# 访问数据集划分
train_data = dataset["train"]
val_data = dataset["validation"]
test_data = dataset["test"]
#### 手动加载
python
import json
from pathlib import Path
# 加载COCO标注
with open("coco/train.json") as f:
train_coco = json.load(f)
# 类别信息
categories = train_coco["categories"] # 6 classes with IDs 0-5
# 图像元数据
images = train_coco["images"] # Image metadata
# 边界框标注
annotations = train_coco["annotations"] # Bounding boxes
### 标注格式
每条标注均遵循COCO格式:
json
{
"id": 12345,
"image_id": 123,
"category_id": 5, // 0-5 (remapped from original 11 classes)
"bbox": [x_min, y_min, width, height], // In pixels
"area": 12345.67,
"iscrowd": 0
}
### 类别映射
原始DocLayNet类别 → 6分类过滤版类别映射:
| 原始ID | 原始类别名称 | 过滤后ID | 过滤后类别名称 | 保留状态 |
|-------------|---------------|-------------|---------------|--------|
| 0 | 图注/表注 | 0 | 图注/表注 | ✅ 保留 |
| 1 | 脚注 | - | - | ❌ 移除 |
| 2 | 公式 | - | - | ❌ 移除 |
| 3 | 列表项 | 1 | 列表项 | ✅ 保留 |
| 4 | 页脚 | - | - | ❌ 移除 |
| 5 | 页眉 | - | - | ❌ 移除 |
| 6 | 图片 | 2 | 图片 | ✅ 保留 |
| 7 | 章节标题 | 3 | 章节标题 | ✅ 保留 |
| 8 | 表格 | 4 | 表格 | ✅ 保留 |
| 9 | 文本 | 5 | 文本 | ✅ 保留 |
| 10 | 标题 | - | - | ❌ 移除 |
## 使用场景
本过滤数据集适用于以下场景:
- **文档版面分析**:聚焦于文档内容结构解析
- **文档信息抽取**:从文档中提取文本、表格、图像等信息
- **文档人工智能目标检测模型训练**
- **多尺度文档理解任务**
- **迁移学习**:从通用目标检测任务迁移至文档分析任务
## 局限性
1. **未包含图像文件**:需从原始DocLayNet数据集获取图像文件
2. **类别不平衡**:文本类别占比过高(占总标注数的53.9%)
3. **领域特定性**:仅聚焦于文档版面分析任务,可能无法泛化至其他领域
4. **标注质量**:继承了原始DocLayNet数据集的所有标注误差
## 伦理考量
- 本数据集保留原始DocLayNet的许可证(CDLA-Permissive-2.0)
- 标注中不包含任何个人或敏感信息
- 源文档来自多样化领域(金融、学术、专利、手册等)
- 不得基于文档类型或来源进行歧视性使用
## 引用要求
若使用本过滤数据集,请同时引用以下两项:
1. **原始DocLayNet论文:**
bibtex
@article{doclaynet2022,
title = {DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis},
author = {Pfitzmann, Birgit and Auer, Christoph and Dolfi, Michele and Nassar, Ahmed S and Staar, Peter W J},
year = {2022},
doi = {10.1145/3534678.3539043},
}
2. **本过滤数据集版本:**
bibtex
@misc{doclaynet6class2024,
title = {DocLayNet 6-Class: Filtered Document Layout Analysis Dataset},
author = {[Keng Boon, Ang]},
year = {2026},
howpublished = {url{https://huggingface.co/datasets/kbang2021/doclaynet-6class}},
note = {Filtered subset of DocLayNet containing 6 primary layout element classes}
}
## 许可证
本过滤数据集保留原始许可证:
**CDLA-Permissive-2.0(社区数据许可协议-宽松版2.0)**
详情请见:https://cdla.dev/permissive-2-0/
## 致谢
- 原始DocLayNet数据集:IBM研究院
- 本数据集基于layout-for-tools评估框架构建
## 联系方式
- 若对本过滤数据集有疑问或发现问题,请在代码仓库中提交Issue。
- 若需咨询原始DocLayNet数据集相关问题,请访问:https://github.com/DS4SD/DocLayNet
提供机构:
kbang2021



