PubLayNet
收藏arXiv2025-09-30 收录
下载链接:
https://github.com/ibm-aur-nlp/publaynet
下载链接
链接失效反馈官方服务:
资源简介:
该数据集包含了33万个科学文献样本,其中布局组件被分类为文本、标题、图表、列表和表格。此外,我们还使用了官方的数据分割方式,对训练、验证和测试数据进行划分。规模上,该数据集包含33万个样本,任务重点在于布局生成。
This dataset contains 330,000 scientific literature samples, where layout components are categorized into five classes: text, title, figure, list and table. Additionally, we adopted the official data splitting method to partition the dataset into training, validation and test subsets. In terms of scale, this dataset has 330,000 samples, and its core task focuses on layout generation.



