文档层级解析数据集DocHieNet

Name: 文档层级解析数据集DocHieNet
Creator: maas
Published: 2026-05-14 20:59:55
License: 暂无描述

魔搭社区2026-05-14 更新2024-12-28 收录

下载链接：

https://modelscope.cn/datasets/iic/DocHieNet

下载链接

链接失效反馈

官方服务：

资源简介：

# The DocHieNet for Document Hierarchy Parsing This repository contains the dataset used for document hierarchy parsing, including the images, and annotations of layout entities, text, reading order of entities and hierarchical structure for documents. ## Dataset Structure The dataset is organized into two main parts: images and labels. The directory named 'hres_images' contains document images with higher resolution. ### Annotation Format Each label file contains the annotations of one document, and the annotations are organized as : ``` { "pages": { "page1": { "width": width of page 1, "height": height of page 1 }, ... }, "contents":[ { "box": [x1, y1, x2, y2], "text": text of the layout entity, "page": page number, "label": type of the layout entity, "linking": [ [ parent_id, self_id ] ], "id": self_id, "order": reading order }, ... ] } ``` Each item in the 'content' list contains the annotations of a layout entity, such as a paragraph, title, table, header, etc. ### Data Split The split of training and testing subset is provided in the 'train_test_split.json' file. The split of English and Chinese subset is provided in the 'en_zh_split.json' file. ## How to Use To set up the dataset, please follow these steps: ```sh cat dochienet_dataset.zip.part-* > dochienet_dataset.zip unzip dochienet_dataset.zip ``` This will reconstruct and unpack the dataset files for use.

# 用于文档层次结构解析的DocHieNet 本仓库包含用于文档层次结构解析的数据集，涵盖文档图像、布局实体（layout entity）与文本标注、实体阅读顺序以及文档层级结构标注。 ## 数据集结构本数据集分为两大核心模块：图像与标签。名为`hres_images`的目录存储高分辨率文档图像。 ### 标注格式每个标签文件对应一份文档的标注信息，其组织形式如下： json { "pages": { "page1": { "width": 页面1的宽度, "height": 页面1的高度 }, ... }, "contents":[ { "box": [x1, y1, x2, y2], "text": 布局实体的文本内容, "page": 所属页码, "label": 布局实体的类型, "linking": [ [ 父实体ID, 当前实体ID ] ], "id": 当前实体ID, "order": 阅读顺序 }, ... ] } `contents`列表中的每一项均对应一个布局实体的标注信息，例如段落、标题、表格、页眉等。 ### 数据划分训练集与测试集的划分信息存储于`train_test_split.json`文件；英语与中文子集的划分信息存储于`en_zh_split.json`文件。 ## 使用方法如需部署并使用该数据集，请执行以下命令： sh cat dochienet_dataset.zip.part-* > dochienet_dataset.zip unzip dochienet_dataset.zip 该命令将重组并解压数据集文件，以供后续使用。

提供机构：

maas

创建时间：

2024-12-20

搜集汇总

数据集介绍