jefehern/vlmevalkit_inference
收藏数据集卡片:Dataset Name
数据集详情
- 数据集来源: VLMEvalKit
- 数据集创建者: Jefferson Hernandez
数据集用途
- 可以重现原始结果,使用默认的LLM-as-evaluator评判标准。
- 可以更换评判标准,使用其他评判模型。
数据集结构
- 数据集使用VLMEvalKit的默认推理命令创建。
- 生成的Excel文件用于评估。
数据集创建
数据收集和处理
- 数据集通过以下bash脚本创建:
bash #!/bin/bash trap kill $(jobs -p) SIGINT SIGTERM run_evaluation() { echo "Running evaluation for $2" CUDA_VISIBLE_DEVICES="$1" python run.py --data MMBench_DEV_EN SEEDBench_IMG MathVista_MINI OCRVQA_TESTCORE ChartQA_TEST LLaVABench RealWorldQA MMStar MMMU_DEV_VAL ScienceQA_VAL HallusionBench TextVQA_VAL AI2D_TEST OCRBench POPE MMVet --model "$2" --mode infer --work-dir vlmeval_results/ & } echo "Starting evaluation run"
# Models that require 1 GPU each
run_evaluation "0" "llava_next_vicuna_7b" run_evaluation "1" "llava_next_mistral_7b" run_evaluation "2" "idefics2_8b" run_evaluation "3" "MiniCPM-Llama3-V-2_5" run_evaluation "4" "llava_v1.5_7b" run_evaluation "5" "llava_next_llama3_8b" run_evaluation "6" "Phi-3-Vision" run_evaluation "7" "paligemma-3b-mix-448" wait # Wait for all the processes to complete run_evaluation "0" "uio2-xxl" run_evaluation "1" "uio2-xl" run_evaluation "2" "idefics2_8b_chatty" wait # Wait for all the processes to complete
# Models that require 2 GPUs each
run_evaluation "0,1" "cogvlm2-llama3-chat-19B" run_evaluation "2,3" "llava_next_vicuna_13b" run_evaluation "4,5" "llava_v1.5_13b" wait # Wait for all the processes to complete
# Models that require 4 GPUs each
run_evaluation "0,1,2,3" "InternVL-Chat-V1-5" run_evaluation "4,5,6,7" "llava_next_yi_34b" wait # Wait for all the processes to complete
# Models that require 8 GPUs
run_evaluation "0,1,2,3,4,5,6,7" "llava_next_qwen15_72b" wait # Wait for all the processes to complete echo "Finishing evaluation run"
使用Meta-Llama-3-70B-Instruct作为评判标准的结果
| 模型大小 | 模型名称 | 图像模型 | 文本模型 | RANK | AVG | AI2D | ChartQA | HallBench | LlaVABench | MMB | MMMU | MMStar | MMVet | MathVista | OCRBench | OCRVQA | POPE | RWQA | SEEDBench | SQA | TextVQA |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| >30B | InternVL 1.5-26B | InternViT-6B | InternLM2-Chat-20B | 1.5 | 70.7 | 80.41 | 83.76 | 67.09 | 80.50 | 80.93 | 46.11 | 58.60 | 46.33 | 48.20 | 72.20 | 64.36 | 86.92 | 65.62 | 75.90 | 93.13 | 80.50 |
| >30B | LLaVA-NeXT-34B (672) | CLIP-ViT-L/14 336 | Yi-34B | 3.5 | 64.1 | 77.69 | 67.04 | 47.74 | 80.60 | 79.21 | 46.00 | 52.07 | 43.12 | 35.70 | 51.70 | 65.95 | 89.50 | 65.75 | 75.67 | 78.97 | 69.19 |
| >7B | MiniCPM-Llama3-V-2_5 | SigLIP-So400m/14 | LLaMA3-8B | 4.0 | 67.3 | 78.53 | 72.08 | 59.83 | 77.80 | 76.72 | 43.33 | 51.93 | 40.83 | 51.90 | 72.40 | 63.38 | 87.13 | 63.53 | 72.28 | 88.36 | 76.65 |
| >3B | Phi-3-vision-128k-instruct-4.2B | CLIP-ViT-L/14 336 | Phi3 | 5.5 | 65.5 | 78.56 | 81.64 | 58.25 | 80.60 | 73.97 | 46.00 | 47.53 | 35.78 | 44.20 | 64.00 | 61.85 | 83.39 | 58.95 | 70.95 | 90.22 | 72.31 |
| >7B | Idefics2-8B | SigLIP-So400m/14 | Mistral-7B | 5.0 | 58.5 | 72.51 | 26.60 | 57.83 | 82.20 | 76.37 | 48.67 | 49.33 | 32.11 | 44.30 | 63.40 | 3.26 | 86.24 | 60.00 | 72.30 | 86.84 | 73.59 |
| >10B | CogVLM2-Llama3-chat-19B | EVA2-CLIP-E | LLaMA3-8B | 7.0 | 56.7 | 73.61 | 9.68 | 57.31 | 79.30 | 73.88 | 40.56 | 49.00 | 45.87 | 33.30 | 76.20 | 55.01 | 83.13 | 64.58 | 74.29 | 87.46 | 4.80 |
| >50B | LLaVA-NeXT-72B (672) | CLIP-ViT-L/14 336 | Qwen1.5-72B | 8.0 | 59.2 | 75.29 | 44.72 | 50.47 | 81.10 | 78.09 | 45.33 | 43.93 | 39.45 | 41.10 | 39.60 | 66.96 | 83.03 | 57.65 | 71.80 | 79.35 | 49.07 |
| >10B | LLaVA-NeXT-13B (672) | CLIP-ViT-L/14 336 | Vicuna-13B | 8.5 | 59.4 | 72.12 | 60.88 | 47.11 | 81.20 | 70.36 | 38.11 | 41.20 | 39.91 | 29.60 | 50.80 | 64.55 | 87.60 | 58.04 | 71.48 | 70.58 | 66.61 |
| >3B | Paligemma-3b-mix-448 | SigLIP-So400m/14 | Gemma-3B | 10.5 | 58.8 | 69.43 | 33.68 | 52.58 | 83.70 | 69.50 | 32.67 | 48.53 | 29.82 | 28.40 | 61.70 | 57.62 | 87.39 | 54.90 | 69.93 | 93.23 | 67.95 |
| >7B | LLaVA-NeXT-Mistral-7B (672) | CLIP-ViT-L/14 336 | Mistral-7B | 9.5 | 57.7 | 69.11 | 50.92 | 43.74 | 81.50 | 69.42 | 37.78 | 39.07 | 36.70 | 28.30 | 50.70 | 60.94 | 87.46 | 60.26 | 72.14 | 69.77 | 65.32 |
| >7B | Idefics2-Chatty-8B | SigLIP-So400m/14 | Mistral-7B | 8.5 | 52.0 | 70.76 | 13.48 | 53.94 | 81.70 | 74.14 | 42.33 | 45.00 | 31.65 | 40.00 | 59.40 | 3.26 | 79.66 | 55.16 | 68.97 | 85.22 | 27.16 |
| >7B | LLaVA-NeXT-Llama3-8B (672) | CLIP-ViT-L/14 336 | LLaMA3-8B | 10.0 | 55.2 | 70.73 | 40.84 | 48.48 | 81.80 | 72.08 | 39.33 | 43.13 | 31.65 | 30.50 | 37.10 | 58.07 | 80.90 | 55.16 | 70.66 | 74.34 | 48.98 |
| >7B | LLaVA-NeXT-Vicuna-7B (672) | CLIP-ViT-L/14 336 | Vicuna-7B | 11.0 | 56.7 | 66.39 | 54.80 | 45.01 | 81.40 | 67.96 | 32.00 | 38.60 | 32.57 | 29.20 | 49.50 | 63.35 | 87.07 | 57.12 | 69.86 | 68.10 | 63.82 |
| >10B | LLaVA v1.5-13B | CLIP-ViT-L/14 336 | Vicuna-13B | 13.5 | 51.8 | 60.72 | 18.52 | 45.53 | 82.90 | 68.90 | 37.44 | 33.93 | 27.06 | 26.80 | 33.40 | 63.18 | 88.45 | 54.90 | 68.25 | 69.81 | 48.91 |
| >7B | LLaVA v1.5-7B | CLIP-ViT-L/14 336 | Vicuna-7B | 15.0 | 50.0 | 55.51 | 17.76 | 48.16 | 82.60 | 65.38 | 35.00 | 33.27 | 26.15 | 26.00 | 31.60 | 60.58 | 86.17 | 53.86 | 65.92 | 66.57 | 45.39 |
| >7B | Unified-IO 2-XXL | OpenCLIP-ViT-B/16 384 | T5-XXL | 14.0 | 48.1 | 43.85 | 13.48 | 49.32 | 82.10 | 57.39 | 33.33 | 35.80 | 22.02 | 25.80 | 32.10 | 57.39 | 86.21 | 45.36 | 61.61 | 85.50 | 38.89 |
| >3B | Unified-IO 2-XL | OpenCLIP-ViT-B/16 384 | T5-XL | 15.5 | 45.7 | 40.41 | 12.08 | 46.37 | 83.70 | 51.98 | 33.33 | 34.53 | 16.51 | 24.10 | 29.70 | 54.04 | 81.83 | 47.19 | 61.09 | 77.83 | 36.27 |




