five

DocLayNet

收藏
魔搭社区2026-01-06 更新2025-01-25 收录
下载链接:
https://modelscope.cn/datasets/swift/DocLayNet
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Card for DocLayNet ## Table of Contents - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Dataset Structure](#dataset-structure) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Annotations](#annotations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** https://developer.ibm.com/exchanges/data/all/doclaynet/ - **Repository:** https://github.com/DS4SD/DocLayNet - **Paper:** https://doi.org/10.1145/3534678.3539043 - **Leaderboard:** - **Point of Contact:** ### Dataset Summary DocLayNet provides page-by-page layout segmentation ground-truth using bounding-boxes for 11 distinct class labels on 80863 unique pages from 6 document categories. It provides several unique features compared to related work such as PubLayNet or DocBank: 1. *Human Annotation*: DocLayNet is hand-annotated by well-trained experts, providing a gold-standard in layout segmentation through human recognition and interpretation of each page layout 2. *Large layout variability*: DocLayNet includes diverse and complex layouts from a large variety of public sources in Finance, Science, Patents, Tenders, Law texts and Manuals 3. *Detailed label set*: DocLayNet defines 11 class labels to distinguish layout features in high detail. 4. *Redundant annotations*: A fraction of the pages in DocLayNet are double- or triple-annotated, allowing to estimate annotation uncertainty and an upper-bound of achievable prediction accuracy with ML models 5. *Pre-defined train- test- and validation-sets*: DocLayNet provides fixed sets for each to ensure proportional representation of the class-labels and avoid leakage of unique layout styles across the sets. ### Supported Tasks and Leaderboards We are hosting a competition in ICDAR 2023 based on the DocLayNet dataset. For more information see https://ds4sd.github.io/icdar23-doclaynet/. ## Dataset Structure ### Data Fields DocLayNet provides four types of data assets: 1. PNG images of all pages, resized to square `1025 x 1025px` 2. Bounding-box annotations in COCO format for each PNG image 3. Extra: Single-page PDF files matching each PNG image 4. Extra: JSON file matching each PDF page, which provides the digital text cells with coordinates and content The COCO image record are defined like this example ```js ... { "id": 1, "width": 1025, "height": 1025, "file_name": "132a855ee8b23533d8ae69af0049c038171a06ddfcac892c3c6d7e6b4091c642.png", // Custom fields: "doc_category": "financial_reports" // high-level document category "collection": "ann_reports_00_04_fancy", // sub-collection name "doc_name": "NASDAQ_FFIN_2002.pdf", // original document filename "page_no": 9, // page number in original document "precedence": 0, // Annotation order, non-zero in case of redundant double- or triple-annotation }, ... ``` The `doc_category` field uses one of the following constants: ``` financial_reports, scientific_articles, laws_and_regulations, government_tenders, manuals, patents ``` ### Data Splits The dataset provides three splits - `train` - `val` - `test` ## Dataset Creation ### Annotations #### Annotation process The labeling guideline used for training of the annotation experts are available at [DocLayNet_Labeling_Guide_Public.pdf](https://raw.githubusercontent.com/DS4SD/DocLayNet/main/assets/DocLayNet_Labeling_Guide_Public.pdf). #### Who are the annotators? Annotations are crowdsourced. ## Additional Information ### Dataset Curators The dataset is curated by the [Deep Search team](https://ds4sd.github.io/) at IBM Research. You can contact us at [deepsearch-core@zurich.ibm.com](mailto:deepsearch-core@zurich.ibm.com). Curators: - Christoph Auer, [@cau-git](https://github.com/cau-git) - Michele Dolfi, [@dolfim-ibm](https://github.com/dolfim-ibm) - Ahmed Nassar, [@nassarofficial](https://github.com/nassarofficial) - Peter Staar, [@PeterStaar-IBM](https://github.com/PeterStaar-IBM) ### Licensing Information License: [CDLA-Permissive-1.0](https://cdla.io/permissive-1-0/) ### Citation Information ```bib @article{doclaynet2022, title = {DocLayNet: A Large Human-Annotated Dataset for Document-Layout Segmentation}, doi = {10.1145/3534678.353904}, url = {https://doi.org/10.1145/3534678.3539043}, author = {Pfitzmann, Birgit and Auer, Christoph and Dolfi, Michele and Nassar, Ahmed S and Staar, Peter W J}, year = {2022}, isbn = {9781450393850}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, booktitle = {Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining}, pages = {3743–3751}, numpages = {9}, location = {Washington DC, USA}, series = {KDD '22} } ``` ### Contributions Thanks to [@dolfim-ibm](https://github.com/dolfim-ibm), [@cau-git](https://github.com/cau-git) for adding this dataset.

# DocLayNet 数据集卡片 ## 目录 - [目录](#table-of-contents) - [数据集描述](#dataset-description) - [数据集概况](#dataset-summary) - [支持任务与评测榜单](#supported-tasks-and-leaderboards) - [数据集结构](#dataset-structure) - [数据字段](#data-fields) - [数据划分](#data-splits) - [数据集构建](#dataset-creation) - [标注信息](#annotations) - [附加信息](#additional-information) - [数据集维护者](#dataset-curators) - [授权信息](#licensing-information) - [引用信息](#citation-information) - [贡献者](#contributions) ## 数据集描述 - **主页**:https://developer.ibm.com/exchanges/data/all/doclaynet/ - **代码仓库**:https://github.com/DS4SD/DocLayNet - **相关论文**:https://doi.org/10.1145/3534678.3539043 - **评测榜单**: - **联络人**: ### 数据集概况 DocLayNet 提供逐页的布局分割(layout segmentation)标注真值,采用边界框(bounding-box)形式,覆盖来自6个文档类别的80863个唯一页面,包含11种不同的类别标签。相较于同类工作如 PubLayNet 或 DocBank,它具备多项独特优势: 1. **人工标注**:DocLayNet 由经过专业培训的专家手动标注,通过对每个页面布局的人工识别与解读,构建了布局分割领域的金标准(gold-standard)。 2. **丰富的布局多样性**:DocLayNet 包含来自金融、科学、专利、招标、法律文本与手册等大量公开来源的多样化复杂布局。 3. **细粒度标签集**:DocLayNet 定义了11个类别标签,可对布局特征进行高精度区分。 4. **冗余标注机制**:DocLayNet 中部分页面采用双标注或三标注,可用于估计标注不确定性,同时为机器学习模型的可达到预测精度提供上限参考。 5. **预定义训练、测试与验证集**:DocLayNet 提供固定的数据集划分,确保类别标签的比例均衡,避免不同划分集之间出现独特布局风格的泄露。 ### 支持任务与评测榜单 我们基于DocLayNet数据集举办了ICDAR 2023竞赛,详细信息请参见https://ds4sd.github.io/icdar23-doclaynet/。 ## 数据集结构 ### 数据字段 DocLayNet 包含四类数据资源: 1. 所有页面的PNG图像,均调整为1025×1025像素的正方形尺寸 2. 每张PNG图像对应的COCO(Common Objects in Context)格式边界框标注 3. 额外资源:与每张PNG图像对应的单页PDF文件 4. 额外资源:与每个PDF页面对应的JSON文件,其中包含带坐标与内容的数字化文本单元格 COCO图像记录的定义如下例所示: js ... { "id": 1, "width": 1025, "height": 1025, "file_name": "132a855ee8b23533d8ae69af0049c038171a06ddfcac892c3c6d7e6b4091c642.png", // 自定义字段: "doc_category": "financial_reports" // 高级文档类别 "collection": "ann_reports_00_04_fancy", // 子集合名称 "doc_name": "NASDAQ_FFIN_2002.pdf", // 原始文档文件名 "page_no": 9, // 原始文档中的页码 "precedence": 0, // 标注顺序,若为冗余双标注或三标注则非零 }, ... 其中`doc_category`字段的取值为以下常量之一: financial_reports, // 金融报告 scientific_articles, // 科学论文 laws_and_regulations, // 法律法规 government_tenders, // 政府采购招标 manuals, // 手册 patents // 专利 ### 数据划分 该数据集包含三个划分: - `train`(训练集) - `val`(验证集) - `test`(测试集) ## 数据集构建 ### 标注信息 #### 标注过程 用于培训标注专家的标注指南可参见[DocLayNet_Labeling_Guide_Public.pdf](https://raw.githubusercontent.com/DS4SD/DocLayNet/main/assets/DocLayNet_Labeling_Guide_Public.pdf)。 #### 标注人员 标注工作由众包完成。 ## 附加信息 ### 数据集维护者 本数据集由IBM Research旗下的[Deep Search团队](https://ds4sd.github.io/)维护。 您可通过[deepsearch-core@zurich.ibm.com](mailto:deepsearch-core@zurich.ibm.com)与我们取得联系。 维护者列表: - Christoph Auer,[@cau-git](https://github.com/cau-git) - Michele Dolfi,[@dolfim-ibm](https://github.com/dolfim-ibm) - Ahmed Nassar,[@nassarofficial](https://github.com/nassarofficial) - Peter Staar,[@PeterStaar-IBM](https://github.com/PeterStaar-IBM) ### 授权信息 授权协议:[CDLA-Permissive-1.0](https://cdla.io/permissive-1.0/) ### 引用信息 bib @article{doclaynet2022, title = {DocLayNet: 面向文档布局分割的大规模人工标注数据集}, doi = {10.1145/3534678.353904}, url = {https://doi.org/10.1145/3534678.3539043}, author = {Pfitzmann, Birgit and Auer, Christoph and Dolfi, Michele and Nassar, Ahmed S and Staar, Peter W J}, year = {2022}, isbn = {9781450393850}, publisher = {Association for Computing Machinery}, address = {美国纽约州纽约市}, booktitle = {第28届ACM SIGKDD知识发现与数据挖掘大会论文集}, pages = {3743–3751}, numpages = {9}, location = {美国华盛顿特区}, series = {KDD '22} } ### 贡献者 感谢[@dolfim-ibm](https://github.com/dolfim-ibm)、[@cau-git](https://github.com/cau-git)为本数据集的收录提供帮助。
提供机构:
maas
创建时间:
2025-01-20
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作