five

DocLayNet-v1.1

收藏
魔搭社区2025-12-05 更新2025-01-25 收录
下载链接:
https://modelscope.cn/datasets/ds4sd/DocLayNet-v1.1
下载链接
链接失效反馈
官方服务:
资源简介:
# Dataset Card for DocLayNet v1.1 ## Table of Contents - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Dataset Structure](#dataset-structure) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Annotations](#annotations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** https://developer.ibm.com/exchanges/data/all/doclaynet/ - **Repository:** https://github.com/DS4SD/DocLayNet - **Paper:** https://doi.org/10.1145/3534678.3539043 ### Dataset Summary DocLayNet provides page-by-page layout segmentation ground-truth using bounding-boxes for 11 distinct class labels on 80863 unique pages from 6 document categories. It provides several unique features compared to related work such as PubLayNet or DocBank: 1. *Human Annotation*: DocLayNet is hand-annotated by well-trained experts, providing a gold-standard in layout segmentation through human recognition and interpretation of each page layout 2. *Large layout variability*: DocLayNet includes diverse and complex layouts from a large variety of public sources in Finance, Science, Patents, Tenders, Law texts and Manuals 3. *Detailed label set*: DocLayNet defines 11 class labels to distinguish layout features in high detail. 4. *Redundant annotations*: A fraction of the pages in DocLayNet are double- or triple-annotated, allowing to estimate annotation uncertainty and an upper-bound of achievable prediction accuracy with ML models 5. *Pre-defined train- test- and validation-sets*: DocLayNet provides fixed sets for each to ensure proportional representation of the class-labels and avoid leakage of unique layout styles across the sets. ## Dataset Structure This dataset is structured differently from the other repository [ds4sd/DocLayNet](https://huggingface.co/datasets/ds4sd/DocLayNet), as this one includes the content (PDF cells) of the detections, and abandons the COCO format. * `image`: page PIL image. * `bboxes`: a list of layout bounding boxes. * `category_id`: a list of class ids corresponding to the bounding boxes. * `segmentation`: a list of layout segmentation polygons. * `pdf_cells`: a list of lists corresponding to `bbox`. Each list contains the PDF cells (content) inside the bbox. * `metadata`: page and document metadetails. Bounding boxes classes / categories: ``` 1: Caption 2: Footnote 3: Formula 4: List-item 5: Page-footer 6: Page-header 7: Picture 8: Section-header 9: Table 10: Text 11: Title ``` The `["metadata"]["doc_category"]` field uses one of the following constants: ``` * financial_reports, * scientific_articles, * laws_and_regulations, * government_tenders, * manuals, * patents ``` ### Data Splits The dataset provides three splits - `train` - `val` - `test` ## Dataset Creation ### Annotations #### Annotation process The labeling guideline used for training of the annotation experts are available at [DocLayNet_Labeling_Guide_Public.pdf](https://raw.githubusercontent.com/DS4SD/DocLayNet/main/assets/DocLayNet_Labeling_Guide_Public.pdf). #### Who are the annotators? Annotations are crowdsourced. ## Additional Information ### Dataset Curators The dataset is curated by the [Deep Search team](https://ds4sd.github.io/) at IBM Research. You can contact us at [deepsearch-core@zurich.ibm.com](mailto:deepsearch-core@zurich.ibm.com). Curators: - Christoph Auer, [@cau-git](https://github.com/cau-git) - Michele Dolfi, [@dolfim-ibm](https://github.com/dolfim-ibm) - Ahmed Nassar, [@nassarofficial](https://github.com/nassarofficial) - Peter Staar, [@PeterStaar-IBM](https://github.com/PeterStaar-IBM) ### Licensing Information License: [CDLA-Permissive-1.0](https://cdla.io/permissive-1-0/) ### Citation Information ```bib @article{doclaynet2022, title = {DocLayNet: A Large Human-Annotated Dataset for Document-Layout Segmentation}, doi = {10.1145/3534678.353904}, url = {https://doi.org/10.1145/3534678.3539043}, author = {Pfitzmann, Birgit and Auer, Christoph and Dolfi, Michele and Nassar, Ahmed S and Staar, Peter W J}, year = {2022}, isbn = {9781450393850}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, booktitle = {Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining}, pages = {3743–3751}, numpages = {9}, location = {Washington DC, USA}, series = {KDD '22} } ```

# DocLayNet v1.1 数据集卡片 ## 目录 - [目录](#table-of-contents) - [数据集描述](#dataset-description) - [数据集概述](#dataset-summary) - [支持任务与基准排行榜](#supported-tasks-and-leaderboards) - [数据集结构](#dataset-structure) - [数据字段](#data-fields) - [数据划分](#data-splits) - [数据集构建](#dataset-creation) - [标注信息](#annotations) - [附加信息](#additional-information) - [数据集维护者](#dataset-curators) - [许可信息](#licensing-information) - [引用信息](#citation-information) - [贡献说明](#contributions) ## 数据集描述 - **主页**:https://developer.ibm.com/exchanges/data/all/doclaynet/ - **代码仓库**:https://github.com/DS4SD/DocLayNet - **论文**:https://doi.org/10.1145/3534678.3539043 ### 数据集概述 DocLayNet提供逐页布局分割的标注真值(ground truth),采用边界框(bounding box)对来自6个文档类别的80863个唯一页面上的11个不同类别标签进行标注。相较于PubLayNet或DocBank等同类数据集,它具备多项独特优势: 1. **人工标注**:DocLayNet由经过专业培训的专家手动标注,通过人工识别与解读每个页面的布局,为布局分割任务提供金标准(gold standard)标注质量。 2. **丰富的布局多样性**:DocLayNet涵盖来自金融、学术、专利、招投标、法律文本与操作手册等多类公开来源的多样化复杂布局。 3. **细粒度标签集**:DocLayNet定义了11个类别标签,可高精度区分各类布局特征。 4. **冗余标注机制**:部分页面采用双标注或三标注的方式,可用于估算标注不确定性,同时为机器学习模型可达到的预测准确率提供上限参考。 5. **预定义数据集划分**:DocLayNet提供固定的训练、验证与测试集合,确保类别标签的比例均衡,避免不同划分集合间出现独特布局样式的跨集泄露。 ### 支持任务与基准排行榜 ## 数据集结构 本数据集的结构与另一个仓库[ds4sd/DocLayNet](https://huggingface.co/datasets/ds4sd/DocLayNet)存在差异:本版本包含检测结果对应的PDF单元格内容,且摒弃了COCO格式。 * `image`:页面PIL图像(PIL image)。 * `bboxes`:布局边界框列表。 * `category_id`:与边界框对应的类别ID列表。 * `segmentation`:布局分割多边形列表。 * `pdf_cells`:与`bbox`对应的列表集合,每个子列表包含对应边界框内的PDF单元格内容。 * `metadata`:页面与文档元数据。 边界框类别/标签如下: 1: 图注(Caption) 2: 脚注(Footnote) 3: 公式(Formula) 4: 列表项(List-item) 5: 页脚(Page-footer) 6: 页头(Page-header) 7: 图片(Picture) 8: 章节标题(Section-header) 9: 表格(Table) 10: 正文(Text) 11: 标题(Title) `["metadata"]["doc_category"]`字段使用以下常量之一: * financial_reports(财务报告), * scientific_articles(学术论文), * laws_and_regulations(法律法规), * government_tenders(政府招投标文件), * manuals(操作手册), * patents(专利文档) ### 数据字段 ### 数据划分 本数据集提供三类划分: - `train`(训练集) - `val`(验证集) - `test`(测试集) ## 数据集构建 ### 标注信息 #### 标注流程 用于培训标注专家的标注指南可在[DocLayNet_Labeling_Guide_Public.pdf](https://raw.githubusercontent.com/DS4SD/DocLayNet/main/assets/DocLayNet_Labeling_Guide_Public.pdf)获取。 #### 标注人员构成 标注工作采用众包形式完成。 ## 附加信息 ### 数据集维护者 本数据集由IBM研究院的[Deep Search团队](https://ds4sd.github.io/)维护。 可通过邮箱[deepsearch-core@zurich.ibm.com](mailto:deepsearch-core@zurich.ibm.com)联系我们。 维护者列表: - Christoph Auer, [@cau-git](https://github.com/cau-git) - Michele Dolfi, [@dolfim-ibm](https://github.com/dolfim-ibm) - Ahmed Nassar, [@nassarofficial](https://github.com/nassarofficial) - Peter Staar, [@PeterStaar-IBM](https://github.com/PeterStaar-IBM) ### 许可信息 许可证:[CDLA-Permissive-1.0](https://cdla.io/permissive-1.0/) ### 引用信息 bib @article{doclaynet2022, title = {DocLayNet: A Large Human-Annotated Dataset for Document-Layout Segmentation}, doi = {10.1145/3534678.353904}, url = {https://doi.org/10.1145/3534678.3539043}, author = {Pfitzmann, Birgit and Auer, Christoph and Dolfi, Michele and Nassar, Ahmed S and Staar, Peter W J}, year = {2022}, isbn = {9781450393850}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, booktitle = {Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining}, pages = {3743–3751}, numpages = {9}, location = {Washington DC, USA}, series = {KDD '22} } ### 贡献说明
提供机构:
maas
创建时间:
2025-01-20
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作