five

浮动体结构分析数据集(FSA)

收藏
魔搭社区2026-01-01 更新2024-10-26 收录
下载链接:
https://modelscope.cn/datasets/irhawks/floating-fsa
下载链接
链接失效反馈
官方服务:
资源简介:
浮动体是在学术文献和书籍等正式出版物中常见的一种页面元素类型。在LaTeX中,浮动体通常指的是可以包含文本、图片、表格、代码、算法等的容器。这些容器在文档中的位置可以由LaTeX自动调整以适应页面布局。为了便于索引和阅读,通常浮动体会在主体(图片、表格、代码块、算法块)之外,增加类型、编号、标题等信息,以使得阅读相对顺畅。版面结构分析(Document Layout Analysis)任务所检测出来的元素数量都极为有限,表格、图片等一般单独处理,给精细的版面分析带来了不便。为此,在现有版面结构分析的基础上,增加了浮动体位置检测和浮动体结构分析两项任务。并参考DocGenome数据,寻找arXiv文档,使用X-AnyLabel分别标注,形成浮动体检测数据集(FLD)以及浮动体结构分析数据集(FSA),各600张。

Floats are a common type of page element in formal publications such as academic papers and books. In LaTeX, floats typically refer to containers that can hold text, images, tables, code, algorithms, and other content. The positions of these containers in a document can be automatically adjusted by LaTeX to adapt to the page layout. To facilitate indexing and reading, floats usually add additional information such as type, number, and caption outside their main content (images, tables, code blocks, algorithm blocks) to ensure smooth reading. The number of elements detected by Document Layout Analysis (DLA) tasks is extremely limited, and items such as tables and images are generally processed separately, which poses inconvenience to fine-grained layout analysis. To address this limitation, two new tasks—float position detection and float structure analysis—are introduced based on the existing Document Layout Analysis framework. Referring to the DocGenome dataset, we collected arXiv documents and annotated them separately using X-AnyLabel, thereby creating the Float Detection Dataset (FLD) and the Float Structure Analysis Dataset (FSA), each with 600 samples.
提供机构:
maas
创建时间:
2024-10-19
搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
该数据集专注于文档布局分析中的浮动体结构分析(FSA)任务,旨在检测LaTeX文档中浮动体(如图形、表格、算法和代码)的子结构,包括标题和内容。它基于arXiv文档构建,包含600张标注图像,用于解决传统分析中浮动体类型有限和整体关系识别不足的问题。
以上内容由遇见数据集搜集并总结生成
二维码
社区交流群
二维码
科研交流群
商业服务