GMAI-MMBench|医疗AI数据集|多模态评估数据集

魔搭社区2025-07-03 更新2025-01-04 收录

医疗AI

多模态评估

下载链接：

https://modelscope.cn/datasets/OpenGVLab/GMAI-MMBench

下载链接

链接失效反馈

资源简介：

# GMAI-MMBench [🍎 **Homepage**](https://uni-medical.github.io/GMAI-MMBench.github.io/#2023xtuner) | [**🤗 Dataset**](https://huggingface.co/datasets/myuniverse/GMAI-MMBench) | [**🤗 Paper**](https://huggingface.co/papers/2408.03361) | [**📖 arXiv**]() | [**🐙 GitHub**](https://github.com/uni-medical/GMAI-MMBench) | [**🌐 OpenDataLab**](https://opendatalab.com/GMAI/MMBench) This repository is the official implementation of the paper **GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI**. ## 🌈 Update - **🚀[2024-09-26]: Accepted by NeurIPS 2024 Datasets and Benchmarks Track!🌟** ## 🚗 Tutorial This project is built upon **VLMEvalKit**. To get started: 1. Visit the [VLMEvalKit Quickstart Guide](https://github.com/open-compass/VLMEvalKit/blob/main/docs/en/get_started/Quickstart.md) for installation instructions. You can following command for installation: ```bash git clone https://github.com/open-compass/VLMEvalKit.git cd VLMEvalKit pip install -e . ``` 2. **VAL data evaluation**: You can run the evaluation using either `python` or `torchrun`. Here are some examples: ```bash # When running with `python`, only one VLM instance is instantiated, and it might use multiple GPUs (depending on its default behavior). # That is recommended for evaluating very large VLMs (like IDEFICS-80B-Instruct). # IDEFICS-80B-Instruct on GMAI-MMBench_VAL, Inference and Evalution python run.py --data GMAI-MMBench_VAL --model idefics_80b_instruct --verbose # IDEFICS-80B-Instruct on GMAI-MMBench_VAL, Inference only python run.py --data GMAI-MMBench_VAL --model idefics_80b_instruct --verbose --mode infer # When running with `torchrun`, one VLM instance is instantiated on each GPU. It can speed up the inference. # However, that is only suitable for VLMs that consume small amounts of GPU memory. # IDEFICS-9B-Instruct, Qwen-VL-Chat, mPLUG-Owl2 on GMAI-MMBench_VAL. On a node with 8 GPU. Inference and Evaluation. torchrun --nproc-per-node=8 run.py --data GMAI-MMBench_VAL --model idefics_80b_instruct qwen_chat mPLUG-Owl2 --verbose # Qwen-VL-Chat on GMAI-MMBench_VAL. On a node with 2 GPU. Inference and Evaluation. torchrun --nproc-per-node=2 run.py --data GMAI-MMBench_VAL --model qwen_chat --verbose ``` The evaluation results will be printed as logs, besides. **Result Files** will also be generated in the directory `$YOUR_WORKING_DIRECTORY/{model_name}`. Files ending with `.csv` contain the evaluated metrics. **TEST data evaluation** ```bash # When running with `python`, only one VLM instance is instantiated, and it might use multiple GPUs (depending on its default behavior). # That is recommended for evaluating very large VLMs (like IDEFICS-80B-Instruct). # IDEFICS-80B-Instruct on GMAI-MMBench_VAL, Inference and Evalution python run.py --data GMAI-MMBench_TEST --model idefics_80b_instruct --verbose # IDEFICS-80B-Instruct on GMAI-MMBench_VAL, Inference only python run.py --data GMAI-MMBench_TEST --model idefics_80b_instruct --verbose --mode infer # When running with `torchrun`, one VLM instance is instantiated on each GPU. It can speed up the inference. # However, that is only suitable for VLMs that consume small amounts of GPU memory. # IDEFICS-9B-Instruct, Qwen-VL-Chat, mPLUG-Owl2 on GMAI-MMBench_VAL. On a node with 8 GPU. Inference and Evaluation. torchrun --nproc-per-node=8 run.py --data GMAI-MMBench_TEST --model idefics_80b_instruct qwen_chat mPLUG-Owl2 --verbose # Qwen-VL-Chat on GMAI-MMBench_VAL. On a node with 2 GPU. Inference and Evaluation. torchrun --nproc-per-node=2 run.py --data GMAI-MMBench_TEST --model qwen_chat --verbose ``` Due to the test data not having the answer available, an error will occur after running. This error indicates that VLMEvalKit cannot retrieve the answer during the final result matching stage. ![image1](image1.png) You can access the generated intermediate results from VLMEvalKit/outputs/\. This is the content of the intermediate result Excel file, where the model's predictions are listed under "prediction." ![image2](image2.png) You will then need to send this Excel file via email to guoanwang971@gmail.com. The email must include the following information: \, \, and \. We will calculate the accuracy of your model using the answer key and periodically update the leaderboard. 3. You can find more details on https://github.com/open-compass/VLMEvalKit/blob/main/vlmeval/dataset/image_mcq.py. ## To render an image into visualization. To facilitate users in testing benchmarks with VLMEvalKit, we have stored our data directly in TSV format, requiring no additional operations to use our benchmark seamlessly with this tool. To prevent data leakage, we have included an "answer" column in the VAL data, while removing the "answer" column from the Test data. For the "image" column, we have used Base64 encoding (to comply with [VLMEvalKit](https://github.com/open-compass/VLMEvalKit)'s requirements). The encryption code is as follows: ```python image = cv2.imread(image_path, cv2.IMREAD_COLOR) encoded_image = encode_image_to_base64(image) def encode_image_to_base64(image): """Convert image to base64 string.""" _, buffer = cv2.imencode('.png', image) return base64.b64encode(buffer).decode() ``` The code for converting the Base64 format back into an image can be referenced from the official [VLMEvalKit](https://github.com/open-compass/VLMEvalKit): ```python def decode_base64_to_image(base64_string, target_size=-1): image_data = base64.b64decode(base64_string) image = Image.open(io.BytesIO(image_data)) if image.mode in ('RGBA', 'P'): image = image.convert('RGB') if target_size > 0: image.thumbnail((target_size, target_size)) return image ``` If needed, below is the official code provided by [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) for converting an image to Base64 encoding: ```python def encode_image_to_base64(img, target_size=-1): # if target_size == -1, will not do resizing # else, will set the max_size ot (target_size, target_size) if img.mode in ('RGBA', 'P'): img = img.convert('RGB') if target_size > 0: img.thumbnail((target_size, target_size)) img_buffer = io.BytesIO() img.save(img_buffer, format='JPEG') image_data = img_buffer.getvalue() ret = base64.b64encode(image_data).decode('utf-8') return ret def encode_image_file_to_base64(image_path, target_size=-1): image = Image.open(image_path) return encode_image_to_base64(image, target_size=target_size) ``` ## Benchmark Details We introduce GMAI-MMBench: the most comprehensive general medical AI benchmark with well-categorized data structure and multi-perceptual granularity to date. It is constructed from **284 datasets** across **38 medical image modalities**, **18 clinical-related tasks**, **18 departments**, and **4 perceptual granularities** in a Visual Question Answering (VQA) format. Additionally, we implemented a **lexical tree** structure that allows users to customize evaluation tasks, accommodating various assessment needs and substantially supporting medical AI research and applications. We evaluated 50 LVLMs, and the results show that even the advanced GPT-4o only achieves an accuracy of 52\%, indicating significant room for improvement. We believe GMAI-MMBench will stimulate the community to build the next generation of LVLMs toward GMAI. ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64324ceff76c34519e97c645/ZzryetCcAb43x88xqtOUO.png) ## Benchmark Creation GMAI-MMBench is constructed from 284 datasets across 38 medical image modalities. These datasets are derived from the public (268) and several hospitals (16) that have agreed to share their ethically approved data. The data collection can be divided into three main steps: 1) We search hundreds of datasets from both the public and hospitals, then keep 284 datasets with highly qualified labels after dataset filtering, uniforming image format, and standardizing label expression. 2) We categorize all labels into 18 clinical VQA tasks and 18 clinical departments, then export a lexical tree for easily customized evaluation. 3) We generate QA pairs for each label from its corresponding question and option pool. Each question must include information about image modality, task cue, and corresponding annotation granularity. The final benchmark is obtained through additional validation and manual selection. ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64324ceff76c34519e97c645/PndRciL1221KdTHkXmGsK.png) ## Lexical Tree In this work, to make the GMAI-MMBench more intuitive and user-friendly, we have systematized our labels and structured the entire dataset into a lexical tree. Users can freely select the test contents based on this lexical tree. We believe that this customizable benchmark will effectively guide the improvement of models in specific areas. ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64324ceff76c34519e97c645/TxpmG_zY0JiALptSw42Pf.png) You can see the complete lexical tree at [**🍎 Homepage**](https://uni-medical.github.io/GMAI-MMBench.github.io/#2023xtuner). ## Evaluation An example of how to use the Lexical Tree for customizing evaluations. The process involves selecting the department (ophthalmology), choosing the modality (fundus photography), filtering questions using relevant keywords, and evaluating different models based on their accuracy in answering the filtered questions. ![image/jpeg](https://cdn-uploads.huggingface.co/production/uploads/64324ceff76c34519e97c645/o7ga5ZBIiTs0owhQP4Hoi.jpeg) ## 🏆 Leaderboard | Rank | Model Name | Val | Test | |:----:|:-------------------------:|:-----:|:-----:| | | Random | 25.70 | 25.94 | | 1 | GPT-4o | 53.53 | 53.96 | | 2 | Gemini 1.5 | 47.42 | 48.36 | | 3 | Gemini 1.0 | 44.38 | 44.93 | | 4 | GPT-4V | 42.50 | 44.08 | | 5 | MedDr | 41.95 | 43.69 | | 6 | MiniCPM-V2 | 41.79 | 42.54 | | 7 | DeepSeek-VL-7B | 41.73 | 43.43 | | 8 | Qwen-VL-Max | 41.34 | 42.16 | | 9 | LLAVA-InternLM2-7b | 40.07 | 40.45 | | 10 | InternVL-Chat-V1.5 | 38.86 | 39.73 | | 11 | TransCore-M | 38.86 | 38.70 | | 12 | XComposer2 | 38.68 | 39.20 | | 13 | LLAVA-V1.5-7B | 38.23 | 37.96 | | 14 | OmniLMM-12B | 37.89 | 39.30 | | 15 | Emu2-Chat | 36.50 | 37.59 | | 16 | mPLUG-Owl2 | 35.62 | 36.21 | | 17 | CogVLM-Chat | 35.23 | 36.08 | | 18 | Qwen-VL-Chat | 35.07 | 36.96 | | 19 | Yi-VL-6B | 34.82 | 34.31 | | 20 | Claude3-Opus | 32.37 | 32.44 | | 21 | MMAlaya | 32.19 | 32.30 | | 22 | Mini-Gemini-7B | 32.17 | 31.09 | | 23 | InstructBLIP-7B | 31.80 | 30.95 | | 24 | Idelecs-9B-Instruct | 29.74 | 31.13 | | 25 | VisualGLM-6B | 29.58 | 30.45 | | 26 | RadFM | 22.95 | 22.93 | | 27 | Qilin-Med-VL-Chat | 22.34 | 22.06 | | 28 | LLaVA-Med | 20.54 | 19.60 | | 29 | Med-Flamingo | 12.74 | 11.64 | ## Disclaimers The guidelines for the annotators emphasized strict compliance with copyright and licensing rules from the initial data source, specifically avoiding materials from websites that forbid copying and redistribution. Should you encounter any data samples potentially breaching the copyright or licensing regulations of any site, we encourage you to contact us. Upon verification, such samples will be promptly removed. ## Contact - Jin Ye: jin.ye@monash.edu - Junjun He: hejunjun@pjlab.org.cn - Qiao Yu: qiaoyu@pjlab.org.cn ## Citation **BibTeX:** ```bibtex @misc{chen2024gmaimmbenchcomprehensivemultimodalevaluation, title={GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI}, author={Pengcheng Chen and Jin Ye and Guoan Wang and Yanjun Li and Zhongying Deng and Wei Li and Tianbin Li and Haodong Duan and Ziyan Huang and Yanzhou Su and Benyou Wang and Shaoting Zhang and Bin Fu and Jianfei Cai and Bohan Zhuang and Eric J Seibel and Junjun He and Yu Qiao}, year={2024}, eprint={2408.03361}, archivePrefix={arXiv}, primaryClass={eess.IV}, url={https://arxiv.org/abs/2408.03361}, } ```

提供机构：

maas

创建时间：

2024-12-27

用户留言

有没有相关的论文或文献参考？

这个数据集是基于什么背景创建的？

数据集的作者是谁？

能帮我联系到这个数据集的作者吗？

这个数据集如何下载？

点击留言

数据主题

具身智能

数据集 4098个

机构 8个

大模型

数据集 439个

机构 10个

无人机

数据集 37个

机构 6个

指令微调

数据集 36个

机构 6个

蛋白质结构

数据集 50个

机构 8个

空间智能

数据集 21个

机构 5个

5,000+

优质数据集

54 个

任务类型

进入经典数据集

热门数据集

YOLO Drone Detection Dataset

为了促进无人机检测模型的开发和评估，我们引入了一个新颖且全面的数据集，专门为训练和测试无人机检测算法而设计。该数据集来源于Kaggle上的公开数据集，包含在各种环境和摄像机视角下捕获的多样化的带注释图像。数据集包括无人机实例以及其他常见对象，以实现强大的检测和分类。

github 收录

Asteroids by the Minor Planet Center

包含所有已知小行星的轨道数据和观测数据。数据来源于Minor Planet Center，格式包括Fortran (.DAT)和JSON，数据集大小为81MB（压缩）和450MB（未压缩），记录数约750,000条，每日更新。

github 收录

DrugBank, TWOSIDES

DrugBank和TWOSIDES是用于药物-药物相互作用（DDI）预测的两个广泛使用的公共数据集。DrugBank包含86种药物间的药理相互作用，而TWOSIDES记录了药物间的副作用，保留了209种相互作用类型。这些数据集通过提取药物指纹和使用生物医学网络作为辅助信息，用于训练和评估DDI预测模型。数据集的应用领域主要集中在药理学和医疗保健中，旨在通过预测药物间的潜在相互作用来提高患者安全和治疗效果。

arXiv 收录

Population and Housing Census of 2007 - Ethiopia

Geographic coverage --------------------------- National coverage Analysis unit --------------------------- Household Person Housing unit Universe --------------------------- The census has counted people on dejure and defacto basis. The dejure population comprises all the persons who belong to a given area at a given time by virtue of usual residence, while under defacto approach people were counted as the residents of the place where they found. In the census, a person is said to be a usual resident of a household (and hence an area) if he/she has been residing in the household continuously for at least six months before the census day or intends to reside in the household for six months or longer. Thus, visitors are not included with the usual (dejure) population. Homeless persons were enumerated in the place where they spent the night on the enumeration day. The 2007 census counted foreign nationals who were residing in the city administration. On the other hand all Ethiopians living abroad were not counted. Kind of data --------------------------- Census/enumeration data [cen] Mode of data collection --------------------------- Face-to-face [f2f] Research instrument --------------------------- Two type sof questionnaires were used to collect census data: i) Short questionnaire ii) Long questionnaire Unlike the previous censuses, the contents of the short and long questionnaires were similar both for the urban and rural areas as well as for the entire city. But the short and the long questionnaires differ by the number of variables they contained. That is, the short questionnaire was used to collect basic data on population characteristics, such as population size, sex, age, language, ethnic group, religion, orphanhood and disability. Whereas the long questionnaire includes information on marital status, education, economic activity, migration, fertility, mortality, as well as housing stocks and conditions in addition to those questions contained in a short questionnaire.

catalog.ihsn.org 收录

ai-hub2

本项目所使用的数据集名为“ai-hub2”，其主要目的是为改进YOLOv11的工地工程车辆装置检测系统提供高质量的训练数据。该数据集包含五个类别，分别是：钻孔机（boring_machine）、混凝土车（concrete_truck）、起重机（crane）、自卸车（dump_truck）和挖掘机（excavator）。这些类别涵盖了工地上常见的重型机械设备，能够有效支持车辆检测系统在复杂环境中的应用。

github 收录