FlagEval/MeasureBench

Name: FlagEval/MeasureBench
Creator: FlagEval
Published: 2025-11-03 02:08:53
License: 暂无描述

Hugging Face2025-11-03 更新2026-01-03 收录

下载链接：

https://hf-mirror.com/datasets/FlagEval/MeasureBench

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: question_id dtype: string - name: question dtype: string - name: image dtype: image - name: image_type dtype: string - name: design dtype: string - name: evaluator dtype: string - name: evaluator_kwargs dtype: string - name: meta_info struct: - name: source dtype: string - name: uploader dtype: string - name: split dtype: string splits: - name: real_world num_bytes: 101881211.28 num_examples: 1272 - name: synthetic_test num_bytes: 84545022.06 num_examples: 1170 download_size: 182712804 dataset_size: 186426233.34 configs: - config_name: default data_files: - split: real_world path: data/real_world-* - split: synthetic_test path: data/synthetic_test-* license: cc-by-sa-4.0 task_categories: - image-text-to-text language: - en pretty_name: MeasureBench size_categories: - 1K<n<10K --- # Do Vision-Language Models Measure Up? Benchmarking Visual Measurement Reading with MeasureBench 🏠[Project Page](https://flageval-baai.github.io/MeasureBenchPage/) | 💻[Code](https://github.com/flageval-baai/MeasureBench) | 📖[Paper](https://arxiv.org/abs/2510.26865/) | 🤗[Data](https://huggingface.co/datasets/FlagEval/MeasureBench) Fine-grained visual understanding tasks such as visual measurement reading have been surprisingly challenging for frontier general-purpose vision-language models. We introduce MeasureBench, a benchmark with diverse images of measuring instruments collected from both real-world images and a new data synthesis pipeline. ![MeasureBench overview](src/intro.jpg) MeasureBench comprises 2442 image–question pairs: 1272 diverse real-world images collected and human-annotated, and 1170 synthetic images generated with randomized readings for 39 instruments. ## Evaluation Findings - **Persisting difficulty.** Current VLMs still struggle with instrument reading, with the best model achieving only 30.3\% accuracy on the real-world set and 26.1\% on the synthetic set. - **Object recognition and text reading seems easy, but inferring numbers is hard.** Models exhibit strong image understanding and text recognition—e.g., reading units—reaching over 90\% accuracy on unit identification. Yet they falter on mapping scales to numeric values. - **Systematic fine-grained errors.** Models often "know how to read" but miss details: They misinterpret pointer positions, confuse adjacent ticks, and mismatch values to scale markings, leading to near-miss but incorrect answers. ## Licensing Information MeasureBench is licensed under the [CC BY-SA 4.0 License](https://creativecommons.org/licenses/by-sa/4.0/). ## 🥺 Citation Information ```bibtex @misc{lin2025measurebench, title={Do Vision-Language Models Measure Up? Benchmarking Visual Measurement Reading with MeasureBench}, author={Fenfen Lin, Yesheng Liu, Haiyu Xu, Chen Yue, Zheqi He, Mingxuan Zhao, Miguel Hu Chen, Jiakang Liu, JG Yao, Xi Yang}, year={2025}, eprint={2510.26865}, archivePrefix={arXiv}, primaryClass={cs.CV} } ```

提供机构：

FlagEval

5,000+

优质数据集

54 个

任务类型

进入经典数据集