PubMed, Chn
收藏arXiv2020-03-30 更新2024-06-21 收录
下载链接:
https://github.com/kailigo/cddod
下载链接
链接失效反馈官方服务:
资源简介:
本研究建立了跨领域文档对象检测的基准套件,包含不同类型的PDF文档数据集,如PubMed和Chn。PubMed数据集是从医学期刊文章中提取的,包含超过360万对象实例的标注,涵盖文本、标题、列表、表格和图形等5个类别。Chn数据集是通过爬取中文维基百科页面并转换成带有边界框标注的PDF文件生成的,其布局和样式参数根据真实文档统计随机抽样。这些数据集不仅提供了页面图像和边界框标注,还包括原始PDF文件和PDF渲染层,用于模型训练和评估。数据集的应用领域主要集中在智能文档编辑和理解,旨在解决文档对象在布局、大小、宽高比、纹理等方面的显著变化问题。
This study establishes a benchmark suite for cross-domain document object detection, which includes various types of PDF document datasets such as PubMed and Chn. The PubMed dataset is extracted from medical journal articles, containing annotations of over 3.6 million object instances across five categories: text, title, list, table, and figure. The Chn dataset is generated by crawling Chinese Wikipedia pages and converting them into PDF files with bounding box annotations, where its layout and style parameters are randomly sampled based on statistics from real-world documents. These datasets provide not only page images and bounding box annotations, but also the original PDF files and PDF rendering layers for model training and evaluation. Their application fields primarily focus on intelligent document editing and understanding, aiming to address the significant variations of document objects in terms of layout, size, aspect ratio, texture, and other relevant aspects.
提供机构:
东北大学, Adobe研究院, Adobe文档云
创建时间:
2020-03-30
搜集汇总
数据集介绍

以上内容由遇见数据集搜集并总结生成



