SCICAP

Name: SCICAP
Creator: arXiv
License: 暂无描述

arXiv2025-09-30 收录

下载链接：

https://github.com/tingyaohsu/scicap

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集名为SCICAP，是基于2010年至2020年间发布的计算机科学arXiv论文构建的大规模图解字幕数据集。它包含了超过290,000篇论文中提取的超过200万个图表。该数据集专注于为单一图形图生成字幕，特别收录了首句字幕、单句字幕以及不超过100个单词的字幕。其规模之大，涵盖了来自290,000篇论文的超过200万个图表，任务旨在为科学图表生成字幕。

This dataset, named SCICAP, is a large-scale scientific figure captioning dataset constructed from computer science arXiv papers published between 2010 and 2020. It contains over 2 million figures extracted from more than 290,000 academic papers. This dataset focuses on generating captions for individual scientific figures, specifically including opening sentence captions, single-sentence captions, and captions with no more than 100 words. Boasting such a large scale with over 2 million figures from 290,000 papers, its core task is to generate captions for scientific figures.

提供机构：

arXiv

搜集汇总

数据集介绍

背景与挑战

背景概述

SCICAP是一个基于计算机科学arXiv论文的大型科学图表标题数据集，包含超过416k个图表，主要关注graphplot类型。该数据集旨在支持研究人员开发能够自动分析和标题科学图表的计算模型，提供了丰富的图表和标题数据，以及详细的数据处理和信息标注。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集