five

ProBench

收藏
魔搭社区2025-10-09 更新2025-03-22 收录
下载链接:
https://modelscope.cn/datasets/AI-Bench/ProBench
下载链接
链接失效反馈
官方服务:
资源简介:
# ProVision **ProVision** is a benchmark dataset for evaluating state-of-the-art multi-modal language models (MLLMs) across diverse tasks such as science, coding, creative writing, information extraction, perception, knowledge, arts, planning, and mathematics. The dataset aggregates chat instances with associated images, reference answers from gpt-4o, and meta-information (e.g. challenge difficulty, category labels, language, and interconnect decisions) to facilitate comprehensive evaluation on single-round, multi-linguistic, and multi-round tracks. The benchmark reports performance metrics such as ELO ratings, average output token counts, 95% confidence intervals, and win rates for each task. Results are organized by model family (e.g. proprietary, 70B+ open‑source, 10B+ open‑source, and 7B+ open‑source MLLMs). --- ## Dataset Structure The dataset is provided with one chat instance per line. Each instance includes: - **uuid:** Unique identifier. - **image:** An associated image (loaded as a PIL.Image object). - **chat_type:** Track type (e.g., `singleround`, `multi-linguistic`, or `multi-round`). - **conversations:** A list of conversation turns (with reference answers from gpt-4o). - **challenge:** A dictionary with challenge difficulty information (derived from scores on textual, image, and reasoning aspects). - **category:** A dictionary with image and question category labels (or a default value). - **subcategory:** A dictionary with image and question subcategory labels (or a default value). - **language:** Language code (default is `en`). - **interconnect:** A field telling if the conversations are interconnected (e.g., “YES”, “NO”, or “NA”). --- ## Usage To load the dataset with the Hugging Face Datasets library: ```python from datasets import load_dataset ds = load_dataset("HelloKKMe/ProBench") ``` ## Evaluation Please refer to our [Github](?) for evaluation ## ProVision Leaderboard ### Single-round ### Single-Round Track Table | Model | Sci. | Cd. | CW. | IE. | Perc. | Knowl. | Arts | Plan. | Math. | Mt. | #Token | 95% CI | WR | Elo | |:------------------------------------|-------:|------:|------:|------:|--------:|---------:|-------:|--------:|--------:|------:|---------:|:---------|-----:|------:| | Pixtral-Large-Instruct-2411 | 1230 | 1194 | 1280 | 1242 | 1224 | 1250 | 1245 | 1221 | 1175 | 1266 | 715 | (-8, 8) | 65.97 | 1229 | | claude-3-5-sonnet-20241022 | 1228 | 1252 | 1259 | 1211 | 1213 | 1272 | 1236 | 1192 | 1197 | 1251 | 405 | (-7, 8) | 65.84 | 1228 | | gemini-1.5-pro-002 | 1151 | 1145 | 1105 | 1100 | 1110 | 1067 | 1107 | 1095 | 1134 | 1147 | 500 | (-8, 10) | 50.58 | 1118 | | gpt-4o-2024-05-13 | 1114 | 1114 | 1114 | 1114 | 1114 | 1114 | 1114 | 1114 | 1114 | 1114 | 491 | (0, 0) | 50.00 | 1114 | | gpt-4o-mini-2024-07-18 | 1049 | 1074 | 1165 | 1094 | 1096 | 1101 | 1130 | 1102 | 1037 | 1159 | 526 | (-8, 10) | 47.12 | 1094 | | gpt-4o-2024-08-06 | 1096 | 1112 | 1050 | 1097 | 995 | 1080 | 1032 | 1058 | 1175 | 1015 | 374 | (-7, 7) | 44.98 | 1079 | | gemini-1.5-flash-002 | 1025 | 877 | 1092 | 1007 | 1022 | 1011 | 993 | 946 | 1035 | 1087 | 493 | (-8, 9) | 35.33 | 1009 | | InternVL2_5-78B | 1083 | 1018 | 1051 | 1091 | 1031 | 1084 | 1042 | 1073 | 1065 | 1023 | 558 | (-7, 10) | 42.85 | 1064 | | Pixtral-12B-2409 | 1028 | 965 | 1099 | 1031 | 1024 | 1057 | 1047 | 1083 | 996 | 1063 | 659 | (-5, 8) | 39.1 | 1037 | | Aria-Chat | 990 | 982 | 985 | 937 | 998 | 1034 | 1019 | 974 | 973 | 1016 | 675 | (-7, 8) | 32.88 | 990 | | InternVL2_5-38B | 1000 | 979 | 1028 | 987 | 1021 | 904 | 932 | 1041 | 1026 | 933 | 521 | (-9, 9) | 32.5 | 987 | | Qwen2-VL-72B-Instruct | 1009 | 914 | 965 | 991 | 986 | 960 | 962 | 921 | 998 | 970 | 557 | (-9, 9) | 31.37 | 978 | | InternVL2_5-26B | 890 | 816 | 1008 | 894 | 944 | 876 | 864 | 964 | 880 | 896 | 490 | (-10, 8) | 22.59 | 900 | | InternVL2_5-8B | 824 | 806 | 983 | 880 | 914 | 840 | 915 | 895 | 835 | 868 | 644 | (-11, 8) | 20.45 | 878 | | Molmo-72B-0924 | 828 | 733 | 953 | 859 | 903 | 881 | 862 | 817 | 871 | 852 | 301 | (-12, 8) | 18.46 | 856 | | NVLM-D-72B | 780 | 877 | 991 | 810 | 849 | 835 | 767 | 881 | 838 | 725 | 561 | (-10, 10)| 16.63 | 834 | | Qwen2-VL-7B-Instruct | 803 | 689 | 827 | 877 | 861 | 816 | 736 | 680 | 858 | 833 | 787 | (-9, 10) | 15.40 | 818 | | Llama-3.2-90B-Vision-Instruct | 830 | 751 | 624 | 754 | 806 | 842 | 626 | 769 | 940 | 662 | 448 | (-11, 10)| 12.89 | 782 | | llava-onevision-qwen2-72b-ov | 696 | 735 | 762 | 726 | 767 | 689 | 663 | 679 | 853 | 620 | 360 | (-11, 12)| 10.09 | 734 | | Llama-3.2-11B-Vision-Instruct | 671 | 541 | 681 | 702 | 766 | 761 | 624 | 524 | 744 | 614 | 531 | (-13, 16)| 7.93 | 688 | | MiniCPM-V-2_6 | 644 | 599 | 767 | 659 | 812 | 676 | 673 | 667 | 656 | 681 | 646 | (-12, 10)| 7.97 | 689 | | llava-onevision-qwen2-7b-ov | 605 | 570 | 807 | 683 | 809 | 681 | 715 | 608 | 573 | 724 | 575 | (-13, 10)| 7.93 | 688 | | Molmo-7B-D-0924 | 536 | 304 | 720 | 631 | 638 | 655 | 681 | 531 | 613 | 603 | 310 | (-14, 12)| 5.41 | 617 | | Molmo-7B-O-0924 | 457 | 134 | 623 | 483 | 681 | 599 | 606 | 380 | 428 | 528 | 296 | (-18, 19)| 3.54 | 540 | ### Multi-linguistic | Model | PT | FR | ES | DE | Other | #Token | 95% CI | WR | Elo | |---------------------------------|------|------|------|------|-------|--------|------------|-------|------| | claude-3-5-sonnet-20241022 | 1248 | 1319 | 1335 | 1389 | 1309 | 485 | (-21, 29) | 74.58 | 1301 | | Pixtral-Large-Instruct-2411 | 1229 | 1496 | 1216 | 1324 | 1286 | 966 | (-23, 22) | 73.81 | 1294 | | gemini-1.5-pro-002 | 1273 | 1168 | 1131 | 1168 | 1139 | 629 | (-20, 20) | 59.11 | 1178 | | gpt-4o-2024-08-06 | 1159 | 1224 | 1226 | 1259 | 1114 | 480 | (-17, 26) | 60.35 | 1187 | | gpt-4o-2024-05-13 | 1114 | 1114 | 1114 | 1114 | 1114 | 585 | (0, 0) | 50.0 | 1114 | | gpt-4o-mini-2024-07-18 | 1038 | 1079 | 1071 | 1151 | 1099 | 657 | (-21, 16) | 45.84 | 1085 | | Qwen2-VL-72B-Instruct | 1067 | 1199 | 944 | 1241 | 999 | 834 | (-18, 21) | 47.56 | 1097 | | InternVL2_5-38B | 1038 | 1092 | 1070 | 1100 | 1044 | 868 | (-20, 18) | 43.98 | 1072 | | InternVL2_5-78B | 948 | 1125 | 1035 | 1123 | 1084 | 841 | (-14, 20) | 42.71 | 1063 | | Pixtral-12B-2409 | 935 | 1096 | 998 | 1077 | 929 | 1199 | (-14, 22) | 35.73 | 1012 | | Aria-Chat | 964 | 1042 | 983 | 1041 | 999 | 1014 | (-23, 17) | 35.33 | 1009 | | gemini-1.5-flash-002 | 1031 | 990 | 845 | 1015 | 815 | 567 | (-25, 19) | 28.47 | 954 | | NVLM-D-72B | 900 | 863 | 850 | 898 | 918 | 907 | (-17, 25) | 21.99 | 894 | | Llama-3.2-90B-Vision-Instruct | 905 | 860 | 824 | 863 | 864 | 968 | (-29, 21) | 20.92 | 883 | | Molmo-72B-0924 | 834 | 835 | 852 | 853 | 878 | 426 | (-27, 19) | 18.9 | 861 | | InternVL2_5-26B | 779 | 858 | 782 | 880 | 839 | 814 | (-28, 19) | 17.7 | 847 | | Qwen2-VL-7B-Instruct | 701 | 875 | 673 | 865 | 678 | 1216 | (-24, 22) | 12.25 | 772 | | llava-onevision-qwen2-72b-ov | 782 | 810 | 609 | 800 | 729 | 534 | (-27, 24) | 11.95 | 767 | | InternVL2_5-8B | 760 | 776 | 765 | 821 | 602 | 1021 | (-22, 20) | 11.95 | 767 | | Llama-3.2-11B-Vision-Instruct | 714 | 663 | 626 | 627 | 665 | 2027 | (-29, 21) | 8.4 | 699 | | MiniCPM-V-2_6 | 522 | 559 | 603 | 634 | 455 | 890 | (-36, 35) | 4.44 | 581 | | Molmo-7B-D-0924 | 445 | 495 | 577 | 613 | 505 | 406 | (-52, 33) | 4.32 | 576 | | llava-onevision-qwen2-7b-ov | 579 | 386 | 144 | 403 | 588 | 686 | (-68, 37) | 3.07 | 514 | | Molmo-7B-O-0924 | 383 | 256 | 536 | 246 | 429 | 512 | (-73, 51) | 1.95 | 433 | ### Multi-round | Model | 2 | 3 | 4 | 5 | 6+ | #Token | 95% CI | WR | Elo | |--------------------------------|------|------|------|------|------|--------|------------|-------|------| | claude-3-5-sonnet-20241022 | 1260 | 1249 | 1356 | 1248 | 1321 | 1477 | (-20, 18) | 70.82 | 1268 | | Pixtral-Large-Instruct-2411 | 1233 | 1273 | 1304 | 1376 | 1253 | 2593 | (-23, 19) | 69.73 | 1259 | | gpt-4o-mini-2024-07-18 | 1147 | 1143 | 1142 | 1200 | 1151 | 1749 | (-17, 24) | 55.16 | 1150 | | gemini-1.5-pro-002 | 1136 | 1140 | 1107 | 1207 | 1145 | 1425 | (-26, 19) | 53.88 | 1141 | | gpt-4o-2024-05-13 | 1114 | 1114 | 1114 | 1114 | 1114 | 1563 | (0, 0) | 50.0 | 1114 | | gpt-4o-2024-08-06 | 1146 | 1050 | 1138 | 1023 | 965 | 1052 | (-22, 18) | 45.41 | 1082 | | InternVL2_5-78B | 1135 | 1040 | 1148 | 1015 | 992 | 2015 | (-21, 20) | 44.84 | 1078 | | Pixtral-12B-2409 | 1054 | 1008 | 1160 | 1013 | 1035 | 2264 | (-19, 20) | 40.48 | 1047 | | gemini-1.5-flash-002 | 1015 | 1040 | 1015 | 1119 | 1006 | 1388 | (-16, 19) | 38.14 | 1030 | | InternVL2_5-38B | 1003 | 1037 | 1036 | 913 | 902 | 1734 | (-18, 21) | 34.68 | 1004 | | Qwen2-VL-72B-Instruct | 1023 | 972 | 1033 | 936 | 875 | 1608 | (-21, 19) | 32.24 | 985 | | Aria-Chat | 937 | 913 | 946 | 887 | 812 | 2321 | (-27, 12) | 23.92 | 913 | | Molmo-72B-0924 | 886 | 817 | 787 | 920 | 808 | 967 | (-28, 25) | 18.64 | 858 | | InternVL2_5-26B | 881 | 811 | 805 | 753 | 638 | 1554 | (-27, 28) | 15.77 | 823 | | InternVL2_5-8B | 814 | 724 | 775 | 686 | 559 | 1835 | (-25, 22) | 11.77 | 764 | | llava-onevision-qwen2-72b-ov | 753 | 721 | 673 | 525 | 692 | 1176 | (-31, 26) | 10.3 | 738 | | Llama-3.2-90B-Vision-Instruct | 754 | 757 | 784 | 426 | 605 | 1350 | (-36, 24) | 9.88 | 730 | | Qwen2-VL-7B-Instruct | 808 | 622 | 637 | 557 | 495 | 2004 | (-34, 25) | 9.48 | 722 | | NVLM-D-72B | 770 | 557 | 602 | 641 | 682 | 1371 | (-35, 33) | 8.49 | 701 | | llava-onevision-qwen2-7b-ov | 737 | 591 | 649 | N/A | 512 | 1743 | (-30, 30) | 6.58 | 653 | | Llama-3.2-11B-Vision-Instruct | 741 | 380 | 487 | 275 | 490 | 2094 | (-38, 32) | 6.03 | 637 | | MiniCPM-V-2_6 | 664 | 575 | 628 | 530 | 389 | 1861 | (-33, 37) | 5.35 | 615 | | Molmo-7B-D-0924 | 672 | 470 | 523 | 409 | 618 | 923 | (-34, 26) | 5.04 | 604 | | Molmo-7B-O-0924 | 589 | 413 | 490 | N/A | 402 | 925 | (-49, 37) | 3.43 | 534 | ## Citation ``` @article{?, title={?}, author={?}, journal={?}, year={?} } ```
提供机构:
maas
创建时间:
2025-03-19
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作