docling-DocLayNet-v1.1

Name: docling-DocLayNet-v1.1
Creator: maas
Published: 2025-12-05 16:22:42
License: 暂无描述

魔搭社区2025-12-05 更新2025-02-15 收录

下载链接：

https://modelscope.cn/datasets/ds4sd/docling-DocLayNet-v1.1

下载链接

链接失效反馈

官方服务：

资源简介：

# Dataset Card for DocLayNet v1.2 ## Dataset Description - **Homepage:** https://developer.ibm.com/exchanges/data/all/doclaynet/ - **Repository:** https://github.com/DS4SD/DocLayNet - **Paper:** https://doi.org/10.1145/3534678.3539043 ### Dataset Summary This dataset is an extention of the [original DocLayNet dataset](https://github.com/DS4SD/DocLayNet) which embeds the PDF files of the document images inside a binary column. DocLayNet provides page-by-page layout segmentation ground-truth using bounding-boxes for 11 distinct class labels on 80863 unique pages from 6 document categories. It provides several unique features compared to related work such as PubLayNet or DocBank: 1. *Human Annotation*: DocLayNet is hand-annotated by well-trained experts, providing a gold-standard in layout segmentation through human recognition and interpretation of each page layout 2. *Large layout variability*: DocLayNet includes diverse and complex layouts from a large variety of public sources in Finance, Science, Patents, Tenders, Law texts and Manuals 3. *Detailed label set*: DocLayNet defines 11 class labels to distinguish layout features in high detail. 4. *Redundant annotations*: A fraction of the pages in DocLayNet are double- or triple-annotated, allowing to estimate annotation uncertainty and an upper-bound of achievable prediction accuracy with ML models 5. *Pre-defined train- test- and validation-sets*: DocLayNet provides fixed sets for each to ensure proportional representation of the class-labels and avoid leakage of unique layout styles across the sets. ## Dataset Structure This dataset is structured differently from the other repository [ds4sd/DocLayNet](https://huggingface.co/datasets/ds4sd/DocLayNet), as this one includes the content (PDF cells) of the detections, and abandons the COCO format. * `image`: page PIL image. * `bboxes`: a list of layout bounding boxes. * `category_id`: a list of class ids corresponding to the bounding boxes. * `segmentation`: a list of layout segmentation polygons. * `area`: Area of the bboxes. * `pdf_cells`: a list of lists corresponding to `bbox`. Each list contains the PDF cells (content) inside the bbox. * `metadata`: page and document metadetails. * `pdf`: Binary blob with the original PDF image. Bounding boxes classes / categories: ``` 1: Caption 2: Footnote 3: Formula 4: List-item 5: Page-footer 6: Page-header 7: Picture 8: Section-header 9: Table 10: Text 11: Title ``` The `["metadata"]["doc_category"]` field uses one of the following constants: ``` * financial_reports, * scientific_articles, * laws_and_regulations, * government_tenders, * manuals, * patents ``` ### Data Splits The dataset provides three splits - `train` - `val` - `test` ## Dataset Creation ### Annotations #### Annotation process The labeling guideline used for training of the annotation experts are available at [DocLayNet_Labeling_Guide_Public.pdf](https://raw.githubusercontent.com/DS4SD/DocLayNet/main/assets/DocLayNet_Labeling_Guide_Public.pdf). #### Who are the annotators? Annotations are crowdsourced. ## Additional Information ### Dataset Curators The dataset is curated by the [Deep Search team](https://ds4sd.github.io/) at IBM Research. You can contact us at [deepsearch-core@zurich.ibm.com](mailto:deepsearch-core@zurich.ibm.com). Curators: - Christoph Auer, [@cau-git](https://github.com/cau-git) - Michele Dolfi, [@dolfim-ibm](https://github.com/dolfim-ibm) - Ahmed Nassar, [@nassarofficial](https://github.com/nassarofficial) - Peter Staar, [@PeterStaar-IBM](https://github.com/PeterStaar-IBM) ### Licensing Information License: [CDLA-Permissive-1.0](https://cdla.io/permissive-1-0/) ### Citation Information ```bib @article{doclaynet2022, title = {DocLayNet: A Large Human-Annotated Dataset for Document-Layout Segmentation}, doi = {10.1145/3534678.353904}, url = {https://doi.org/10.1145/3534678.3539043}, author = {Pfitzmann, Birgit and Auer, Christoph and Dolfi, Michele and Nassar, Ahmed S and Staar, Peter W J}, year = {2022}, isbn = {9781450393850}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, booktitle = {Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining}, pages = {3743–3751}, numpages = {9}, location = {Washington DC, USA}, series = {KDD '22} } ```

# DocLayNet v1.2 数据集卡片 ## 数据集说明 - **主页：** https://developer.ibm.com/exchanges/data/all/doclaynet/ - **代码仓库：** https://github.com/DS4SD/DocLayNet - **论文：** https://doi.org/10.1145/3534678.3539043 ### 数据集概述本数据集是[原始DocLayNet数据集](https://github.com/DS4SD/DocLayNet)的扩展版本，将文档图像对应的PDF文件嵌入至二进制列中。 DocLayNet提供逐页的布局分割真值标注，针对6个文档类别下的80863个唯一页面，使用边界框（bounding box）为11种不同类别标签提供标注。相比PubLayNet、DocBank等同类研究工作，其具备多项独特优势： 1. **人工标注**：DocLayNet由经过专业培训的专家手工标注，通过对每一页布局的人工识别与解读，构建了文档布局分割任务的金标准标注集。 2. **丰富的布局多样性**：DocLayNet涵盖来自金融、学术、专利、招标、法律文本与操作手册等多类公开来源的多样化复杂布局。 3. **精细的标签集**：DocLayNet定义了11个类别标签，可对布局特征进行高精度区分。 4. **冗余标注**：DocLayNet中部分页面经过双标注或三标注，可用于估算标注不确定性，同时为机器学习模型可达到的预测精度上限提供参考依据。 5. **预定义划分集**：DocLayNet提供固定的训练、验证与测试集划分，确保各类标签的比例均衡，避免不同划分集之间出现独特布局样式的数据泄露。 ## 数据集结构本数据集的结构与另一仓库[ds4sd/DocLayNet](https://huggingface.co/datasets/ds4sd/DocLayNet)存在差异，本版本包含检测结果对应的PDF单元格内容，并弃用了COCO（Common Objects in Context）格式。各字段说明如下： * `image`：页面的PIL（Python Imaging Library）图像。 * `bboxes`：布局边界框列表。 * `category_id`：与边界框一一对应的类别ID列表。 * `segmentation`：布局分割多边形列表。 * `area`：边界框的面积。 * `pdf_cells`：与`bbox`结构对应的列表集合，每个子列表包含对应边界框内的PDF单元格内容。 * `metadata`：页面与文档的元数据详情。 * `pdf`：包含原始PDF文件的二进制大对象（Blob）。边界框类别/标签对应关系如下： 1: 图表说明（Caption） 2: 脚注（Footnote） 3: 公式（Formula） 4: 列表项（List-item） 5: 页脚（Page-footer） 6: 页眉（Page-header） 7: 图片（Picture） 8: 章节标题（Section-header） 9: 表格（Table） 10: 正文文本（Text） 11: 文档标题（Title） `["metadata"]["doc_category"]`字段使用以下常量之一： * financial_reports → 财务报告 * scientific_articles → 学术论文 * laws_and_regulations → 法律法规 * government_tenders → 政府招标 * manuals → 操作手册 * patents → 专利 ### 数据划分本数据集提供三种划分方式： - `train`：训练集 - `val`：验证集 - `test`：测试集 ## 数据集构建 ### 标注信息 #### 标注流程用于培训标注专家的标注指南可在[DocLayNet_Labeling_Guide_Public.pdf](https://raw.githubusercontent.com/DS4SD/DocLayNet/main/assets/DocLayNet_Labeling_Guide_Public.pdf)下载获取。 #### 标注人员本数据集的标注工作通过众包方式完成。 ## 补充信息 ### 数据集管理方本数据集由IBM研究院Deep Search团队整理维护。您可通过邮箱`deepsearch-core@zurich.ibm.com`联系我们。管理团队成员： - Christoph Auer，[@cau-git](https://github.com/cau-git) - Michele Dolfi，[@dolfim-ibm](https://github.com/dolfim-ibm) - Ahmed Nassar，[@nassarofficial](https://github.com/nassarofficial) - Peter Staar，[@PeterStaar-IBM](https://github.com/PeterStaar-IBM) ### 授权信息授权协议：[CDLA-Permissive-1.0](https://cdla.io/permissive-1.0/) ### 引用信息 bib @article{doclaynet2022, title = {DocLayNet: 用于文档布局分割的大规模人工标注数据集}, doi = {10.1145/3534678.353904}, url = {https://doi.org/10.1145/3534678.3539043}, author = {Pfitzmann, Birgit 与 Auer, Christoph 与 Dolfi, Michele 与 Nassar, Ahmed S 与 Staar, Peter W J}, year = {2022}, isbn = {9781450393850}, publisher = {美国计算机协会（Association for Computing Machinery)}, address = {美国纽约州纽约市}, booktitle = {第28届ACM SIGKDD知识发现与数据挖掘大会论文集}, pages = {3743–3751}, numpages = {9}, location = {美国华盛顿特区}, series = {KDD '22} }

提供机构：

maas

创建时间：

2025-02-07

5,000+

优质数据集

54 个

任务类型

进入经典数据集