GenExam

Name: GenExam
Creator: maas
Published: 2025-12-04 16:50:13
License: 暂无描述

魔搭社区2025-12-04 更新2025-09-20 收录

下载链接：

https://modelscope.cn/datasets/OpenGVLab/GenExam

下载链接

链接失效反馈

官方服务：

资源简介：

<div align="center"> <h1 align="center">GenExam: A Multidisciplinary Text-to-Image Exam</h1> [Zhaokai Wang](https://www.wzk.plus/)\*, [Penghao Yin](https://penghaoyin.github.io/)\*, [Xiangyu Zhao](https://scholar.google.com/citations?user=eqFr7IgAAAAJ), [Changyao Tian](https://scholar.google.com/citations?user=kQ3AisQAAAAJ), [Yu Qiao](https://scholar.google.com/citations?user=gFtI-8QAAAAJ), [Wenhai Wang](https://whai362.github.io/), [Jifeng Dai](https://jifengdai.org/), [Gen Luo](https://scholar.google.com/citations?user=EyZqU9gAAAAJ) <p align="center"> <a href='https://huggingface.co/papers/2509.14232'> <img src='https://img.shields.io/badge/Paper-2509.14232-brown?style=flat&logo=arXiv' alt='arXiv PDF'> </a> <a href='https://github.com/OpenGVLab/GenExam'> <img src='https://img.shields.io/badge/Github-black?style=flat&logo=github' alt='data img/data'> </a> <a href='#leaderboard'> <img src='https://img.shields.io/badge/Rank-Leaderboard-blue?style=flat&logo=flipboard' alt='data img/data'> </a> For guidelines on evaluation, please refer to our [repo](https://github.com/OpenGVLab/GenExam). </p> </div> <div align="center"> <img src="assets/teaser.png" alt="teaser" width="100%"> </div> ## ⭐️ Introduction Exams are a fundamental test of expert-level intelligence and require integrated understanding, reasoning, and generation. Existing exam-style benchmarks mainly focus on understanding and reasoning tasks, and current generation benchmarks emphasize the illustration of world knowledge and visual concepts, neglecting the evaluation of rigorous drawing exams. We introduce GenExam, the first benchmark for **multidisciplinary text-to-image exams**, featuring 1,000 samples across 10 subjects with exam-style prompts organized under a four-level taxonomy. Each problem is equipped with ground-truth images and fine-grained scoring points to enable a precise evaluation of semantic correctness and visual plausibility. Experiments show that even state-of-the-art models such as GPT-Image-1 and Gemini-2.5-Flash-Image achieve less than 15% strict scores, and most models yield almost 0%, suggesting the great challenge of our benchmark. By framing image generation as an exam, GenExam offers a rigorous assessment of models' ability to integrate knowledge, reasoning, and generation, providing insights on the path to general AGI. <div align="center"> <img src="assets/overview.png" alt="overview" width="100%"> </div> <a id="radar"></a> ## 🚀 Leaderboard ### Strict Score <table> <tr> <th style="width:25%">Model        </th> <th>Math</th><th>Phy</th><th>Chem</th><th>Bio</th> <th>Geo</th><th>Comp</th><th>Eng</th><th>Econ</th> <th>Music</th><th>Hist</th><th>Overall</th> </tr> <tr> <th colspan="12" style="text-align:left">Closed-source Models</th> </tr> <tr> <td>GPT-Image-1</td><td>8.0</td><td>13.2</td><td>13.5</td><td>22.8</td><td>15.9</td><td>10.3</td><td>13.1</td><td>13.0</td><td>9.3</td><td>2.4</td><td>12.1</td> </tr> <tr> <td>Seedream 4.0</td><td>2.6</td><td>3.5</td><td>5.9</td><td>18.6</td><td>10.6</td><td>6.9</td><td>11.7</td><td>5.2</td><td>0.0</td><td>7.3</td><td>7.2</td> </tr> <tr> <td>Imagen-4-Ultra</td><td>2.6</td><td>9.7</td><td>9.3</td><td>14.7</td><td>7.6</td><td>2.9</td><td>12.6</td><td>9.1</td><td>0.0</td><td>0.0</td><td>6.9</td> </tr> <tr> <td>Gemini-2.5-Flash-Image</td><td>0.7</td><td>7.1</td><td>4.2</td><td>5.1</td><td>4.5</td><td>4.9</td><td>10.0</td><td>1.3</td><td>1.5</td><td>0.0</td><td>3.9</td> </tr> <tr> <td>Seedream 3.0</td><td>0.7</td><td>0.0</td><td>0.8</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.2</td> </tr> <tr> <td>FLUX.1 Kontext max</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td> </tr> <tr> <th colspan="12" style="text-align:left">Open-source T2I Models</th> </tr> <tr> <td>Qwen-Image</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>3.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.3</td> </tr> <tr> <td>HiDream-I1-Full</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td> </tr> <tr> <td>HunyuanImage-3.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td> </tr> <tr> <td>FLUX.1 dev</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td> </tr> <tr> <td>FLUX.1 Krea</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td> </tr> <tr> <td>Stable Diffusion 3.5 Large</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td> </tr> <tr> <th colspan="12" style="text-align:left">Open-source Unified MLLMs</th> </tr> <tr> <td>BAGEL (thinking)</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td> </tr> <tr> <td>BAGEL</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td> </tr> <tr> <td>Show-o2-7B</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td> </tr> <tr> <td>Show-o2-1.5B-HQ</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td> </tr> <tr> <td>BLIP3o-NEXT-GRPO-Text-3</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td> </tr> <tr> <td>BLIP3o-8B</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td> </tr> <tr> <td>Janus-Pro</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td> </tr> <tr> <td>Emu3</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td> </tr> </table> <br> ### Relaxed Score <table> <tr> <th style="width:25%">Model</th> <th>Math</th><th>Phy</th><th>Chem</th><th>Bio</th> <th>Geo</th><th>Comp</th><th>Eng</th><th>Econ</th> <th>Music</th><th>Hist</th><th>Overall</th> </tr> <tr> <th colspan="12" style="text-align:left">Closed-source Models</th> </tr> <tr> <td>GPT-Image-1</td><td>52.0</td><td>66.4</td><td>53.4</td><td>74.6</td><td>73.9</td><td>55.6</td><td>65.5</td><td>65.8</td><td>52.6</td><td>67.4</td><td>62.6</td> </tr> <tr> <td>Seedream 4.0</td><td>39.8</td><td>49.0</td><td>46.1</td><td>71.0</td><td>65.1</td><td>52.2</td><td>60.0</td><td>56.0</td><td>34.5</td><td>56.7</td><td>53.0</td> </tr> <tr> <td>Imagen-4-Ultra</td><td>35.9</td><td>57.4</td><td>44.5</td><td>68.1</td><td>66.9</td><td>40.1</td><td>65.6</td><td>59.7</td><td>38.4</td><td>57.8</td><td>53.4</td> </tr> <tr> <td>Gemini-2.5-Flash-Image</td><td>43.1</td><td>60.9</td><td>45.3</td><td>72.6</td><td>70.2</td><td>47.4</td><td>65.8</td><td>59.8</td><td>37.0</td><td>57.1</td><td>55.9</td> </tr> <tr> <td>Seedream 3.0</td><td>18.6</td><td>21.5</td><td>18.3</td><td>32.2</td><td>38.2</td><td>15.3</td><td>26.5</td><td>12.5</td><td>21.6</td><td>29.2</td><td>23.4</td> </tr> <tr> <td>FLUX.1 Kontext max</td><td>23.5</td><td>25.6</td><td>19.2</td><td>38.3</td><td>47.5</td><td>20.9</td><td>28.9</td><td>22.3</td><td>25.4</td><td>33.5</td><td>28.5</td> </tr> <tr> <th colspan="12" style="text-align:left">Open-source T2I Models</th> </tr> <tr> <td>Qwen-Image</td><td>18.9</td><td>26.3</td><td>15.3</td><td>32.1</td><td>49.6</td><td>18.9</td><td>32.0</td><td>20.3</td><td>23.4</td><td>38.6</td><td>27.5</td> </tr> <tr> <td>HiDream-I1-Full</td><td>16.7</td><td>17.7</td><td>13.5</td><td>27.3</td><td>36.2</td><td>15.4</td><td>24.4</td><td>18.8</td><td>21.3</td><td>31.8</td><td>22.3</td> </tr> <tr> <td>HunyuanImage-3.0</td><td>17.0</td><td>17.2</td><td>18.8</td><td>18.7</td><td>30.4</td><td>15.5</td><td>16.9</td><td>11.7</td><td>23.9</td><td>20.4</td><td>19.1</td> </tr> <tr> <td>FLUX.1 dev</td><td>12.2</td><td>14.4</td><td>12.5</td><td>22.8</td><td>36.4</td><td>11.0</td><td>14.0</td><td>9.2</td><td>21.3</td><td>21.7</td><td>17.6</td> </tr> <tr> <td>FLUX.1 Krea</td><td>7.0</td><td>14.0</td><td>8.5</td><td>26.5</td><td>38.4</td><td>8.4</td><td>15.4</td><td>11.1</td><td>16.8</td><td>17.4</td><td>16.4</td> </tr> <tr> <td>Stable Diffusion 3.5 Large</td><td>12.2</td><td>13.2</td><td>10.7</td><td>21.8</td><td>38.8</td><td>6.6</td><td>16.3</td><td>8.0</td><td>24.1</td><td>18.0</td><td>17.0</td> </tr> <tr> <th colspan="12" style="text-align:left">Open-source Unified MLLMs</th> </tr> <tr> <td>BAGEL (thinking)</td><td>11.7</td><td>13.8</td><td>11.9</td><td>15.2</td><td>28.5</td><td>6.2</td><td>10.7</td><td>6.3</td><td>14.7</td><td>16.0</td><td>13.5</td> </tr> <tr> <td>BAGEL</td><td>14.7</td><td>10.6</td><td>7.9</td><td>10.8</td><td>24.5</td><td>6.8</td><td>10.2</td><td>5.3</td><td>13.7</td><td>14.4</td><td>11.9</td> </tr> <tr> <td>Show-o2-7B</td><td>10.8</td><td>11.9</td><td>4.8</td><td>12.8</td><td>33.3</td><td>4.7</td><td>11.8</td><td>7.0</td><td>8.8</td><td>14.5</td><td>12.0</td> </tr> <tr> <td>Show-o2-1.5B-HQ</td><td>7.3</td><td>7.5</td><td>6.2</td><td>15.0</td><td>25.3</td><td>4.3</td><td>9.3</td><td>7.3</td><td>7.6</td><td>19.8</td><td>11.0</td> </tr> <tr> <td>BLIP3o-NEXT-GRPO-Text-3</td><td>15.5</td><td>10.5</td><td>9.2</td><td>15.5</td><td>23.7</td><td>8.2</td><td>10.1</td><td>8.1</td><td>15.2</td><td>10.2</td><td>12.6</td> </tr> <tr> <td>BLIP3o-8B</td><td>6.4</td><td>5.5</td><td>4.7</td><td>7.0</td><td>16.7</td><td>3.6</td><td>8.4</td><td>2.5</td><td>6.0</td><td>11.2</td><td>7.2</td> </tr> <tr> <td>Janus-Pro</td><td>13.7</td><td>8.8</td><td>8.2</td><td>7.2</td><td>18.8</td><td>3.9</td><td>10.5</td><td>4.2</td><td>14.5</td><td>6.6</td><td>9.6</td> </tr> <tr> <td>Emu3</td><td>11.3</td><td>0.6</td><td>0.6</td><td>5.6</td><td>34.6</td><td>5.1</td><td>16.5</td><td>1.9</td><td>5.8</td><td>6.2</td><td>8.8</td> </tr> </table> ### Comparison Across Four Dimensions <div align="center"> <img src="assets/model_performance_comparison.png" width="100%"> </div> ## 🖼 Examples of Generated Images For more examples, please refer to the appendix in our paper. <div align="center"> <img src="assets/math.png" alt="math" width="100%"> <img src="assets/music.png" alt="math" width="100%"> </div> ## 🛠️ Evaluation Guidelines Please refer to our [repo](https://github.com/OpenGVLab/GenExam). ## 🖊️ Citation If you find our work helpful, please consider giving us a ⭐ and citing our paper: ```bibtex @article{GenExam, title={GenExam: A Multidisciplinary Text-to-Image Exam}, author = {Wang, Zhaokai and Yin, Penghao and Zhao, Xiangyu and Tian, Changyao and Qiao, Yu and Wang, Wenhai and Dai, Jifeng and Luo, Gen}, journal={arXiv preprint arXiv:2509.14232}, year={2025} } ```

提供机构：

maas

创建时间：

2025-09-18

5,000+

优质数据集

54 个

任务类型

进入经典数据集