下载链接：

https://modelscope.cn/datasets/Skywork/CSVQA

下载链接

链接失效反馈

官方服务：

资源简介：

<div align="center"> <img src="skywork-logo.png" alt="Introduction Image" width="500" height="400"> </div> # CSVQA (Chinese Science Visual Question Answering) | 🏆 [Leaderboard](#mini-leaderboard) | 📄 [arXiv](https://arxiv.org/abs/2505.24120) | 💻 [GitHub](https://github.com/SkyworkAI/CSVQA) | 🌐 [Webpage](https://csvqa-benchmark.github.io/) | 📄 [Paper](https://huggingface.co/papers/2505.24120) | ## 🔥News **June 2, 2025**: Our paper is now available on arXiv and we welcome citations：[CSVQA: A Chinese Multimodal Benchmark for Evaluating STEM Reasoning Capabilities of VLMs](https://arxiv.org/abs/2505.24120) **May 30, 2025**: We developed a complete evaluation pipeline, and the implementation details are available on [GitHub](https://github.com/csvqa-benchmark/CSVQA/tree/main) ## 📖 Dataset Introduction Vision-Language Models (VLMs) have demonstrated remarkable progress in multimodal understanding, yet their capabilities for scientific reasoning remain inadequately assessed. Current multimodal benchmarks predominantly evaluate **generic image comprehension** or **text-driven reasoning**, lacking **authentic scientific contexts** that require domain-specific knowledge integration with visual evidence analysis. To fill this gap, we present **CSVQA**, a diagnostic multimodal benchmark specifically designed for evaluating scientific reasoning through *omain-grounded visual question answering. Our benchmark features **1,378 carefully constructed question-answer pairs** spanning diverse **STEM disciplines**, each demanding **domain knowledge**, **integration of visual evidence**, and **higher-order reasoning**. Compared to prior multimodal benchmarks, CSVQA places greater emphasis on real-world scientific content and complex reasoning. ## 🧠 Dataset Description ![Challenges in CSVQA](https://raw.githubusercontent.com/csvqa-benchmark/CSVQA/main/images/challenges.png) CSVQA introduces three key challenges that distinguish it from most existing benchmarks: 1. **Multidisciplinary Coverage**: It spans multiple STEM disciplines, requiring diverse domain knowledge and reasoning strategies. 2. **Visual Modality Diversity**: It includes 14 distinct visual modalities, testing a model’s ability to generalize across varied image structures and complexities. 3. **Real-world Contextualization**: Many questions are grounded in real-world STEM scenarios and demand domain-specific knowledge, requiring models to go beyond pattern recognition. An overview of dataset composition is presented below: <h3 align="center"></h3> <table style="margin: auto;"> <div align='center'> <tbody> <tr style="background-color: #f9f9f9;"> <td style="padding: 10px 15px; border: 1px solid #ddd;">Total Questions</td> <td style="padding: 10px 15px; border: 1px solid #ddd; text-align: right;">1378</td> </tr> <tr> <td style="padding: 10px 15px; border: 1px solid #ddd;">Image Types</td> <td style="padding: 10px 15px; border: 1px solid #ddd; text-align: right;">14</td> </tr> <tr style="background-color: #f9f9f9;"> <td style="padding: 10px 15px; border: 1px solid #ddd;">Easy : Medium : Hard</td> <td style="padding: 10px 15px; border: 1px solid #ddd; text-align: right;">22.6% : 67.4% : 10.0%</td> </tr> <tr> <td style="padding: 10px 15px; border: 1px solid #ddd;">Multiple-choice Questions</td> <td style="padding: 10px 15px; border: 1px solid #ddd; text-align: right;">1278</td> </tr> <tr style="background-color: #f9f9f9;"> <td style="padding: 10px 15px; border: 1px solid #ddd;">Open Questions</td> <td style="padding: 10px 15px; border: 1px solid #ddd; text-align: right;">100</td> </tr> <tr> <td style="padding: 10px 15px; border: 1px solid #ddd;">With Explanation</td> <td style="padding: 10px 15px; border: 1px solid #ddd; text-align: right;">81.1%</td> </tr> <tr style="background-color: #f9f9f9;"> <td style="padding: 10px 15px; border: 1px solid #ddd;">Image in Question</td> <td style="padding: 10px 15px; border: 1px solid #ddd; text-align: right;">1341</td> </tr> <tr> <td style="padding: 10px 15px; border: 1px solid #ddd;">Image in Option</td> <td style="padding: 10px 15px; border: 1px solid #ddd; text-align: right;">37</td> </tr> <tr style="background-color: #f9f9f9;"> <td style="padding: 10px 15px; border: 1px solid #ddd;">Average Question Length</td> <td style="padding: 10px 15px; border: 1px solid #ddd; text-align: right;">69.7</td> </tr> <tr> <td style="padding: 10px 15px; border: 1px solid #ddd;">Average Option Length</td> <td style="padding: 10px 15px; border: 1px solid #ddd; text-align: right;">12.1</td> </tr> <tr style="background-color: #f9f9f9;"> <td style="padding: 10px 15px; border: 1px solid #ddd;">Average Explanation Length</td> <td style="padding: 10px 15px; border: 1px solid #ddd; text-align: right;">123.5</td> </tr> </tbody> </table> <h3 align="center"></h3> <p style="font-style: italic; font-size: 0.9em; color: #666; margin-top: 8px;"> Note: Length analysis is conducted in English for cross-dataset comparison. </p> <a name="mini-leaderboard"></a> ## 🏆 Mini-Leaderboard We highlight the top two performers in each column. <!DOCTYPE html> <html lang="en"> <div align='center'> <table style="margin: auto;"> <thead> <tr> <th><b>Model</b></th> <th><b>Overall</b></th> <th><b>Biology</b></th> <th><b>Chemistry</b></th> <th><b>Math</b></th> <th><b>Physics</b></th> <th><b>Open</b></th> <th><b>MC</b></th> </tr> </thead> <tbody> <tr> <td>Random Choice</td> <td>5.2</td> <td>5.1</td> <td>6.2</td> <td>4.5</td> <td>5.7</td> <td>0</td> <td>5.7</td> </tr> <tr> <td colspan="8"><b>Open-source VLM</b></td> </tr> <tr> <td>Fuyu-8B</td> <td>4.9</td> <td>6.3</td> <td>5.6</td> <td>3.5</td> <td>4.3</td> <td>2.0</td> <td>5.1</td> </tr> <tr> <td>Deepseek-VL2</td> <td>6.2</td> <td>7.0</td> <td>6.2</td> <td>7.6</td> <td>4.5</td> <td>8.0</td> <td>6.0</td> </tr> <tr> <td>LLaVA1.5-13B</td> <td>7.5</td> <td>10.7</td> <td>9.4</td> <td>5.4</td> <td>5.5</td> <td>4.0</td> <td>7.8</td> </tr> <tr> <td>MonoInternVL</td> <td>9.3</td> <td>7.3</td> <td>9.1</td> <td>9.2</td> <td>10.9</td> <td>3.0</td> <td>9.8</td> </tr> <tr> <td>Idefics3-8B</td> <td>10.1</td> <td>11.7</td> <td>15.2</td> <td>7.0</td> <td>7.1</td> <td>4.0</td> <td>10.6</td> </tr> <tr> <td>Pixtral-12B</td> <td>10.5</td> <td>15.3</td> <td>8.8</td> <td>8.6</td> <td>10.0</td> <td>5.0</td> <td>10.9</td> </tr> <tr> <td>Phi-4</td> <td>11.5</td> <td>13.3</td> <td>16.1</td> <td>8.9</td> <td>8.3</td> <td>7.0</td> <td>11.8</td> </tr> <tr> <td>Gemma3-27B</td> <td>22.9</td> <td>26.0</td> <td>23.5</td> <td>27.0</td> <td>17.1</td> <td>23.0</td> <td>22.9</td> </tr> <tr> <td>InternVL2.5-78B</td> <td>28.4</td> <td>36.3</td> <td>36.1</td> <td>24.1</td> <td>19.7</td> <td>16.0</td> <td>29.3</td> </tr> <tr> <td>QVQ-72B</td> <td>36.6</td> <td>40.7</td> <td>41.3</td> <td>33.7</td> <td>32.0</td> <td>32.0</td> <td>36.9</td> </tr> <tr> <td>InternVL3-78B</td> <td>37.4</td> <td><b>46.0</b></td> <td>41.1</td> <td>36.5</td> <td>28.9</td> <td>30.0</td> <td>38.0</td> </tr> <tr> <td>Qwen2.5-VL-72B</td> <td>38.5</td> <td>45.7</td> <td>40.8</td> <td>37.5</td> <td>32.2</td> <td>29.0</td> <td>39.2</td> </tr> <tr> <td colspan="8"><b>Closed-source VLM</b></td> </tr> <tr> <td>GPT-4o</td> <td>23.6</td> <td>28.0</td> <td>23.5</td> <td>23.5</td> <td>20.6</td> <td>18.0</td> <td>24.0</td> </tr> <tr> <td>Claude3.7</td> <td>36.6</td> <td>41.7</td> <td>38.1</td> <td>37.1</td> <td>31.3</td> <td>32.0</td> <td>36.9</td> </tr> <tr> <td>Gemini2.0-flash</td> <td><b>44.1</b></td> <td>45.0</td> <td><b>45.5</b></td> <td><b>47.6</b></td> <td><b>39.8</b></td> <td><b>46.0</b></td> <td><b>44.0</b></td> </tr> <tr> <td>o1</td> <td><b>49.6</b></td> <td><b>46.2</b></td> <td><b>45.1</b></td> <td><b>59.0</b></td> <td><b>49.1</b></td> <td><b>41.3</b></td> <td><b>50.2</b></td> </tr> </tbody> </table> </div> </html> ## ⚠️ Dataset Limitations Although CSVQA covers a broad range of disciplines and diverse question types, there are still some limitations: - **Subject Coverage**: Currently, it only includes high school science content; future versions may extend to undergraduate-level science and engineering. - **Data Distribution Analysis**: We are still analyzing the detailed distribution across subjects and question types to ensure balanced coverage. - **Annotation Noise**: Despite strict quality control, there may be occasional OCR recognition errors or incomplete parsing. --- ## 📩 Contact If you have any questions or suggestions regarding the dataset, please contact: - [shawn0wang76031@gmail.com](mailto:shawn0wang76031@gmail.com) - [jianai@bupt.edu.cn](mailto:jianai@bupt.edu.cn) 🚀 We welcome your feedback and contributions of data and benchmark results! 🎉 ## Citation If you use this work in your research, please cite: ``` @misc{jian2025csvqachinesemultimodalbenchmark, title={CSVQA: A Chinese Multimodal Benchmark for Evaluating STEM Reasoning Capabilities of VLMs}, author={Ai Jian and Weijie Qiu and Xiaokun Wang and Peiyu Wang and Yunzhuo Hao and Jiangbo Pei and Yichen Wei and Yi Peng and Xuchen Song}, year={2025}, eprint={2505.24120}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2505.24120}, } ```

应用场景：