five

harshsingh-mathongo/SeePhys

收藏
Hugging Face2026-04-17 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/harshsingh-mathongo/SeePhys
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 configs: - config_name: default data_files: - split: train path: data/train-* - config_name: dev data_files: - split: total path: dev/total-* - split: dev path: dev/dev-* dataset_info: - config_name: default features: - name: index dtype: int64 - name: question dtype: string - name: answer dtype: string - name: images list: image - name: reasoning dtype: string - name: sig_figs dtype: string - name: level dtype: int64 - name: subject dtype: string - name: language dtype: string - name: img_category dtype: string - name: vision_relevance dtype: string - name: caption dtype: string splits: - name: train num_bytes: 90258908.0 num_examples: 2000 download_size: 77879212 dataset_size: 90258908.0 - config_name: dev features: - name: question dtype: string - name: subject dtype: string - name: image_path sequence: string - name: sig_figs dtype: string - name: level dtype: int64 - name: language dtype: string - name: index dtype: int64 - name: img_category dtype: string - name: vision_relevance dtype: string - name: caption dtype: string - name: image_0 dtype: image - name: image_1 dtype: image - name: image_2 dtype: image - name: image_3 dtype: image splits: - name: total num_bytes: 96133884.0 num_examples: 2000 - name: dev num_bytes: 9343791.0 num_examples: 200 download_size: 86916417 dataset_size: 105477675.0 task_categories: - question-answering - visual-question-answering language: - en tags: - physics - multi-modal size_categories: - 1K<n<10K --- # SeePhys: Does Seeing Help Thinking? -- Benchmarking Vision-Based Physics Reasoning Can AI truly see the Physics? Test your model with the newly released SeePhys Benchmark! Covering 2,000 vision-text multimodal physics problems spanning from middle school to doctoral qualification exams, the SeePhys benchmark systematically evaluates LLMs/MLLMs on tasks integrating complex scientific diagrams with theoretical derivations. Experiments reveal that even SOTA models like Gemini-2.5-Pro and o4-mini achieve accuracy rates below 55%, with over 30% error rates on simple middle-school-level problems, highlighting significant challenges in multimodal reasoning. The benchmark is now open for evaluation at the ICML 2025 AI for MATH Workshop. Academic and industrial teams are invited to test their models! 🔗 Key Links: 📜Paper: http://arxiv.org/abs/2505.19099 ⚛️Project Page: https://seephys.github.io/ 🏆Challenge Submission: https://www.codabench.org/competitions/7925/ ➡️Competition Guidelines: https://sites.google.com/view/ai4mathworkshopicml2025/challenge The answer will be announced on July 1st, 2025 (Anywhere on Earth, AoE), which is after the submission deadline for the ICML 2025 Challenges on Automated Math Reasoning and Extensions. If you find SeePhys useful for your research and applications, please kindly cite using this BibTeX: ``` @article{xiang2025seephys, title={SeePhys: Does Seeing Help Thinking? -- Benchmarking Vision-Based Physics Reasoning}, author={Kun Xiang, Heng Li, Terry Jingchen Zhang, Yinya Huang, Zirong Liu, Peixin Qu, Jixi He, Jiaqi Chen, Yu-Jie Yuan, Jianhua Han, Hang Xu, Hanhui Li, Mrinmaya Sachan, Xiaodan Liang}, journal={arXiv preprint arXiv:2505.19099}, year={2025} } ```
提供机构:
harshsingh-mathongo
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作