five

bioRxiv 10k figure bounding boxes

收藏
NIAID Data Ecosystem2026-03-13 收录
下载链接:
https://zenodo.org/record/5596760
下载链接
链接失效反馈
官方服务:
资源简介:
This dataset contains figure bounding boxes corresponding to the bioRxiv 10k dataset. It provides annotations in two formats: COCO format (JSON) JATS XML with GROBID's "coords" attribute The COCO format contains bounding boxes in rendered pixel units, as well as PDF user units. The latter uses field names with the "pt_" prefix. The "coords" attribute uses the PDF user units. The dataset was generated by using an algorithm to find the figure images within the rendered PDF pages. The main algorithm used for that purpose is SIFT. As a fallback, OpenCV's Template Matching (with multi scaling) was used. There may be some error cases in the document. Very few documents were excluded, were neither algorithm was able to find any match for one of the figure images (six documents in the train subset, two documents in the test subset). Figure images may appear next to a figure description, but they may also appear as "attachments". The latter usually appears at the end of the document (but not always) and often on pages with dimensions different to the regular page size (but not always). This dataset itself doesn't contain any images. The PDF to render pages can be found in the bioRxiv 10k dataset. The dataset is intended for training or evaluation purposes of the semantic Figure extraction. The evaluation score would be calculated by comparing the extracted bounding boxes with the one from this purpose. (example implementation ScienceBeam Judge) The dataset was created as part of eLife's ScienceBeam project.
创建时间:
2021-10-29
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作