bioRxiv 10k figure bounding boxes

NIAID Data Ecosystem2026-03-13 收录

下载链接：

https://zenodo.org/record/5596760

下载链接

链接失效反馈

官方服务：

资源简介：

This dataset contains figure bounding boxes corresponding to the bioRxiv 10k dataset. It provides annotations in two formats: COCO format (JSON) JATS XML with GROBID's "coords" attribute The COCO format contains bounding boxes in rendered pixel units, as well as PDF user units. The latter uses field names with the "pt_" prefix. The "coords" attribute uses the PDF user units. The dataset was generated by using an algorithm to find the figure images within the rendered PDF pages. The main algorithm used for that purpose is SIFT. As a fallback, OpenCV's Template Matching (with multi scaling) was used. There may be some error cases in the document. Very few documents were excluded, were neither algorithm was able to find any match for one of the figure images (six documents in the train subset, two documents in the test subset). Figure images may appear next to a figure description, but they may also appear as "attachments". The latter usually appears at the end of the document (but not always) and often on pages with dimensions different to the regular page size (but not always). This dataset itself doesn't contain any images. The PDF to render pages can be found in the bioRxiv 10k dataset. The dataset is intended for training or evaluation purposes of the semantic Figure extraction. The evaluation score would be calculated by comparing the extracted bounding boxes with the one from this purpose. (example implementation ScienceBeam Judge) The dataset was created as part of eLife's ScienceBeam project.

创建时间：

2021-10-29

5,000+

优质数据集

54 个

任务类型

进入经典数据集