GenAI-Bench-1600

Name: GenAI-Bench-1600
Creator: maas
Published: 2026-05-01 13:19:01
License: 暂无描述

魔搭社区2026-05-01 更新2025-04-12 收录

下载链接：

https://modelscope.cn/datasets/AI-ModelScope/GenAI-Bench-1600

下载链接

链接失效反馈

官方服务：

资源简介：

# ***GenAI-Bench: Evaluating and Improving Compositional Text-to-Visual Generation*** --- <div align="center"> Baiqi Li<sup>1*</sup>, Zhiqiu Lin<sup>1,2*</sup>, Deepak Pathak<sup>1</sup>, Jiayao Li<sup>1</sup>, Yixin Fei<sup>1</sup>, Kewen Wu<sup>1</sup>, Tiffany Ling<sup>1</sup>, Xide Xia<sup>2†</sup>, Pengchuan Zhang<sup>2†</sup>, Graham Neubig<sup>1†</sup>, and Deva Ramanan<sup>1†</sup>. </div> <div align="center" style="font-weight:bold;"> <sup>1</sup>Carnegie Mellon University, <sup>2</sup>Meta </div>  ## Links: <div align="center"> [**📖Paper**](https://arxiv.org/pdf/2406.13743) | | [🏠**Home Page**](https://linzhiqiu.github.io/papers/genai_bench) | | [🔍**GenAI-Bench Dataset Viewer**](https://huggingface.co/spaces/BaiqiL/GenAI-Bench-DataViewer) | [**🏆Leaderboard**](#Leaderboard) | </div> <div align="center"> [🗂️GenAI-Bench-1600(ZIP format)](https://huggingface.co/datasets/BaiqiL/GenAI-Bench-1600) | | [🗂️GenAI-Bench-Video(ZIP format)](https://huggingface.co/datasets/zhiqiulin/GenAI-Bench-800) | | [🗂️GenAI-Bench-Ranking(ZIP format)](https://huggingface.co/datasets/zhiqiulin/GenAI-Image-Ranking-800) </div> ## 🚩 **News** - ✅ Aug. 18, 2024. 💥 GenAI-Bench-1600 is used by 🧨 [**Imagen 3**](https://arxiv.org/abs/2408.07009) ! - ✅ Jun. 19, 2024. 💥 Our [paper](https://openreview.net/pdf?id=hJm7qnW3ym) won the **Best Paper** award at the **CVPR SynData4CV workshop** ! ## Usage ```python # load the GenAI-Bench(GenAI-Bench-1600) benchmark from datasets import load_dataset dataset = load_dataset("BaiqiL/GenAI-Bench") ``` ## Citation Information ``` @article{li2024genai, title={Genai-bench: Evaluating and improving compositional text-to-visual generation}, author={Li, Baiqi and Lin, Zhiqiu and Pathak, Deepak and Li, Jiayao and Fei, Yixin and Wu, Kewen and Ling, Tiffany and Xia, Xide and Zhang, Pengchuan and Neubig, Graham and others}, journal={arXiv preprint arXiv:2406.13743}, year={2024} } ``` ![](https://huggingface.co/datasets/BaiqiL/GenAI-Bench-pictures/resolve/main/GenAI-Bench.jpg) ![](https://huggingface.co/datasets/BaiqiL/GenAI-Bench-pictures/resolve/main/genaibench_examples.jpg) ## Description: Our dataset consists of three parts: **GenAI-Bench (Gen-Bench-1600)**, **GenAI-Bench-Video**, and **GenAI-Bench-Ranking**, with Gen-Bench-1600 being the primary dataset. For detailed processing methods of the above datasets of zip format, please refer to `dataset.py` in [code](https://github.com/linzhiqiu/t2v_metrics). [**GenAI-Bench benchmark (GenAI-Bench-1600)**](https://huggingface.co/datasets/BaiqiL/GenAI-Bench-1600) consists of 1,600 challenging real-world text prompts sourced from professional designers. Compared to benchmarks such as PartiPrompt and T2I-CompBench, GenAI-Bench captures a wider range of aspects in the compositional text-to-visual generation, ranging from _basic_ (scene, attribute, relation) to _advanced_ (counting, comparison, differentiation, logic). GenAI-Bench benchmark also collects human alignment ratings (1-to-5 Likert scales) on images and videos generated by ten leading models, such as Stable Diffusion, DALL-E 3, Midjourney v6, Pika v1, and Gen2. GenAI-Bench: - Prompt: 1600 prompts sourced from professional designers. - Compositional Skill Tags: Multiple compositional tags for each prompt. The compositional skill tags are categorized into **_Basic Skill_** and **_Advanced Skill_**. For detailed definitions and examples, please refer to [our paper](). - Images: Generated images are collected from DALLE_3, DeepFloyd_I_XL_v1, Midjourney_6, SDXL_2_1, SDXL_Base and SDXL_Turbo. - Human Ratings: 1-to-5 Likert scale ratings for each image. **(Other Datasets: [GenAI-Bench-Video](https://huggingface.co/datasets/zhiqiulin/GenAI-Bench-800) | [GenAI-Bench-Ranking](https://huggingface.co/datasets/zhiqiulin/GenAI-Image-Ranking-800))** ### Languages English ### Supported Tasks Text-to-Visual Generation; Evaluation for Automated Evaluation Metrics. ### Comparing GenAI-Bench to Existing Text-to-Visual Benchmarks ![](https://huggingface.co/datasets/BaiqiL/GenAI-Bench-pictures/resolve/main/Comparison.png) ## Dataset Structure ### Data Instances ``` Dataset({ features: ['Index', 'Prompt', 'Tags', 'HumanRatings', 'DALLE_3', 'DeepFloyd_I_XL_v1', 'Midjourney_6', 'SDXL_2_1', 'SDXL_Base', 'SDXL_Turbo'], num_rows: 1600 }) ``` ### Data Fields Name | Explanation --- | --- `Index` | **Description:** the unique ID of an example. **Data type:** string `Prompt` | **Description:** prompt. **Data type:** string `Tags` | **Description:** basic skills in the prompt. **Data type:** dict       `basic_skills` | **Description:** basic skills in the prompt. **Data type:** list       `advanced_skills` | **Description:** advanced skills in the prompt. **Data type:** list `DALLE_3` | **Description:** generated image from DALLE3. **Data type:** PIL.JpegImagePlugin.JpegImageFile `Midjourney_6` | **Description:** generated image from Midjourney_6. **Data type:** PIL.JpegImagePlugin.JpegImageFile `DeepFloyd_I_XL_v1` | **Description:** generated image from DeepFloyd_I_XL_v1. **Data type:** PIL.JpegImagePlugin.JpegImageFile `SDXL_2_1` | **Description:** generated image from SDXL_2_1. **Data type:** PIL.JpegImagePlugin.JpegImageFile `SDXL_Base` | **Description:** generated image from SDXL_Base. **Data type:** PIL.JpegImagePlugin.JpegImageFile `SDXL_Turbo` | **Description:** generated image from SDXL_Turbo. **Data type:** PIL.JpegImagePlugin.JpegImageFile `HumanRatings` | **Description:** human ratings for matching between prrompt and image. **Data type:** dict       `DALLE_3` | **Description:** human ratings for matching between prrompt and image. **Data type:** list       `SDXL_Turbo` | **Description:** human ratings for matching between prrompt and image. **Data type:** list       `Midjourney_6` | **Description:** human ratings for matching between prrompt and image. **Data type:** list       `DeepFloyd_I_XL_v1` | **Description:** human ratings for matching between prrompt and image. **Data type:** list       `SDXL_2_1` | **Description:** human ratings for matching between prrompt and image. **Data type:** list       `SDXL_Base` | **Description:** human ratings for matching between prrompt and image. **Data type:** list ### Statistics Dataset | Number of Prompts | Number of Skill Tags | Number of Images | Number of Videos| Number of Human Ratings| ---| ---: | ---: | ---: | ---: | ---: GenAI-Bench| 1600 | 5,000+ | 9,600 | -- |28,800 GenAI-Bench-Video| 800 | 2,500+ | -- | 3,200 |9,600 GenAI-Ranking| 800 | 2,500+ | 14,400 | -- |43,200 (each prompt-image/video pair has three human ratings.) ## Data Source ### Prompts All prompts are sourced from professional designers who use tools such as Midjourney and CIVITAI. ### Multiple Compositional Tags for Prompts All tags on each prompt are verified by human annotators. ### Generated Images Generating images using all 1,600 GenAI-Bench prompts from DALLE_3, DeepFloyd_I_XL_v1, Midjourney_6, SDXL_2_1, SDXL_Base and SDXL_Turbo. ### Generated Videos Generated Videos using all 800 GenAI-Bench prompts from Pika, Gen2, ModelScope and Floor33. ### Human Ratings We hired three trained human annotators to individually rate each generated image/video. We pay the local minimum wage of 12 dollars per hour for a total of about 800 annotator hours. ## Dataset Construction ### Overall Process ![image/png](https://huggingface.co/datasets/BaiqiL/GenAI-Bench-pictures/resolve/main/Dataset%20Construction.jpg) - **Prompt Collecting:** we source prompts from professional designers who use tools such as Midjourney and CIVITAI. This ensures the prompts encompass practical skills relevant to real-world applications and are free of subjective or inappropriate content. - **Compositional Skills Tagging:** each GenAI-Bench prompt is carefully tagged with all its evaluated skills. We then generate images and videos using state-of-the-art models like SD-XL and Gen2. We follow the recommended annotation protocol to collect 1-to-5 Likert scale ratings for how well the generated visuals align with the input text prompts. - **Image/Video Collecting and Human Rating:** we then generate images and videos using state-of-the-art models like SD-XL and Gen2. We follow the recommended annotation protocol to collect 1-to-5 Likert scale ratings for how well the generated visuals align with the input text prompts. # Leaderboard <img src="https://huggingface.co/datasets/BaiqiL/GenAI-Bench-pictures/resolve/main/vqascore_leaderboard.jpg" alt="leaderboard" width="500"/> ## Licensing Information apache-2.0 ## Maintenance We will continuously update the GenAI-Bench benchmark. If you have any questions about the dataset or notice any issues, please feel free to contact [Baiqi Li](mailto:libaiqi123@gmail.com) or [Zhiqiu Lin](mailto:zhiqiul@andrew.cmu.edu). Our team is committed to maintaining this dataset in the long run to ensure its quality!

# ***GenAI-Bench：评估与改进组合式文本到视觉生成*** --- <div align="center"> Baiqi Li<sup>1*</sup>, Zhiqiu Lin<sup>1,2*</sup>, Deepak Pathak<sup>1</sup>, Jiayao Li<sup>1</sup>, Yixin Fei<sup>1</sup>, Kewen Wu<sup>1</sup>, Tiffany Ling<sup>1</sup>, Xide Xia<sup>2†</sup>, Pengchuan Zhang<sup>2†</sup>, Graham Neubig<sup>1†</sup>, 和 Deva Ramanan<sup>1†</sup>. </div> <div align="center" style="font-weight:bold;"> <sup>1</sup>卡内基梅隆大学，<sup>2</sup>Meta </div>  ## 链接: <div align="center"> [**📖论文**](https://arxiv.org/pdf/2406.13743) | | [🏠**项目主页**](https://linzhiqiu.github.io/papers/genai_bench) | | [🔍**GenAI-Bench数据集查看器**](https://huggingface.co/spaces/BaiqiL/GenAI-Bench-DataViewer) | [**🏆排行榜**](#排行榜) | </div> <div align="center"> [🗂️GenAI-Bench-1600(ZIP格式)](https://huggingface.co/datasets/BaiqiL/GenAI-Bench-1600) | | [🗂️GenAI-Bench-Video(ZIP格式)](https://huggingface.co/datasets/zhiqiulin/GenAI-Bench-800) | | [🗂️GenAI-Bench-Ranking(ZIP格式)](https://huggingface.co/datasets/zhiqiulin/GenAI-Image-Ranking-800) </div> ## 🚩 **新闻** - ✅ 2024年8月18日。 💥 GenAI-Bench-1600被🧨 [**Imagen 3**](https://arxiv.org/abs/2408.07009) 采用！ - ✅ 2024年6月19日。 💥 我们的[论文](https://openreview.net/pdf?id=hJm7qnW3ym)在**CVPR SynData4CV研讨会**上斩获**最佳论文奖**！ ## 使用方法 python # 加载GenAI-Bench（GenAI-Bench-1600）基准数据集 from datasets import load_dataset dataset = load_dataset("BaiqiL/GenAI-Bench") ## 引用信息 {li2024genai, title={GenAI-Bench: Evaluating and Improving Compositional Text-to-Visual Generation}, author={Li, Baiqi and Lin, Zhiqiu and Pathak, Deepak and Li, Jiayao and Fei, Yixin and Wu, Kewen and Ling, Tiffany and Xia, Xide and Zhang, Pengchuan and Neubig, Graham and others}, journal={arXiv preprint arXiv:2406.13743}, year={2024} } ![](https://huggingface.co/datasets/BaiqiL/GenAI-Bench-pictures/resolve/main/GenAI-Bench.jpg) ![](https://huggingface.co/datasets/BaiqiL/GenAI-Bench-pictures/resolve/main/genaibench_examples.jpg) ## 数据集描述: 我们的数据集包含三个子模块：**GenAI-Bench（Gen-Bench-1600）**、**GenAI-Bench-Video**以及**GenAI-Bench-Ranking**，其中Gen-Bench-1600为核心基准数据集。如需了解上述ZIP格式数据集的详细处理方法，请参阅[代码仓库](https://github.com/linzhiqiu/t2v_metrics)中的`dataset.py`文件。 [**GenAI-Bench基准数据集（GenAI-Bench-1600）**](https://huggingface.co/datasets/BaiqiL/GenAI-Bench-1600)包含1600条来自专业设计师的高难度真实世界文本提示词。与PartiPrompt、T2I-CompBench等现有基准相比，GenAI-Bench覆盖了组合式文本到视觉生成任务中更广泛的能力维度，从**基础能力**（场景、属性、关系）到**高级能力**（计数、对比、区分、逻辑推理）均有涉及。该基准还收集了10个主流模型生成的图像与视频的人类对齐评分（1至5分李克特量表），涉及模型包括Stable Diffusion、DALL-E 3、Midjourney v6、Pika v1以及Gen2。 GenAI-Bench: - 提示词：1600条来自专业设计师的提示词。 - 组合式能力标签：每条提示词对应多个组合式标签。这些标签被划分为**_基础能力标签_**与**_高级能力标签_**，详细定义与示例请参阅[我们的论文]()。 - 生成图像：收集了来自DALLE_3、DeepFloyd_I_XL_v1、Midjourney_6、SDXL_2_1、SDXL_Base以及SDXL_Turbo的生成图像。 - 人类评分：每条生成图像对应的1至5分李克特量表评分。 **（其他数据集：[GenAI-Bench-Video](https://huggingface.co/datasets/zhiqiulin/GenAI-Bench-800) | [GenAI-Bench-Ranking](https://huggingface.co/datasets/zhiqiulin/GenAI-Image-Ranking-800)）** ### 语言英语 ### 支持任务文本到视觉生成；自动化评估指标的评估。 ### GenAI-Bench与现有文本到视觉基准的对比 ![](https://huggingface.co/datasets/BaiqiL/GenAI-Bench-pictures/resolve/main/Comparison.png) ## 数据集结构 ### 数据实例 Dataset({ features: ['索引', '提示词', '标签', '人类评分', 'DALLE_3', 'DeepFloyd_I_XL_v1', 'Midjourney_6', 'SDXL_2_1', 'SDXL_Base', 'SDXL_Turbo'], num_rows: 1600 }) ### 数据字段名称 | 说明 --- | --- `索引` | **说明：** 样本的唯一ID。 **数据类型：** 字符串 `提示词` | **说明：** 文本提示词。 **数据类型：** 字符串 `标签` | **说明：** 提示词中的基础能力标签。 **数据类型：** 字典       `基础能力标签` | **说明：** 提示词中的基础能力标签。 **数据类型：** 列表       `高级能力标签` | **说明：** 提示词中的高级能力标签。 **数据类型：** 列表 `DALLE_3` | **说明：** DALLE3生成的图像。 **数据类型：** PIL.JpegImagePlugin.JpegImageFile `Midjourney_6` | **说明：** Midjourney_6生成的图像。 **数据类型：** PIL.JpegImagePlugin.JpegImageFile `DeepFloyd_I_XL_v1` | **说明：** DeepFloyd_I_XL_v1生成的图像。 **数据类型：** PIL.JpegImagePlugin.JpegImageFile `SDXL_2_1` | **说明：** SDXL_2_1生成的图像。 **数据类型：** PIL.JpegImagePlugin.JpegImageFile `SDXL_Base` | **说明：** SDXL_Base生成的图像。 **数据类型：** PIL.JpegImagePlugin.JpegImageFile `SDXL_Turbo` | **说明：** SDXL_Turbo生成的图像。 **数据类型：** PIL.JpegImagePlugin.JpegImageFile `人类评分` | **说明：** 提示词与图像匹配度的人类评分。 **数据类型：** 字典       `DALLE_3` | **说明：** 提示词与图像匹配度的人类评分。 **数据类型：** 列表       `SDXL_Turbo` | **说明：** 提示词与图像匹配度的人类评分。 **数据类型：** 列表       `Midjourney_6` | **说明：** 提示词与图像匹配度的人类评分。 **数据类型：** 列表       `DeepFloyd_I_XL_v1` | **说明：** 提示词与图像匹配度的人类评分。 **数据类型：** 列表       `SDXL_2_1` | **说明：** 提示词与图像匹配度的人类评分。 **数据类型：** 列表       `SDXL_Base` | **说明：** 提示词与图像匹配度的人类评分。 **数据类型：** 列表 ### 统计信息 | 数据集 | 提示词数量 | 标签数量 | 图像数量 | 视频数量 | 人类评分数量 | |---| ---: | ---: | ---: | ---: | ---: | GenAI-Bench| 1600 | 5,000+ | 9,600 | -- |28,800 | GenAI-Bench-Video| 800 | 2,500+ | -- | 3,200 |9,600 | GenAI-Ranking| 800 | 2,500+ | 14,400 | -- |43,200 （每条提示词-图像/视频对应三个人类评分。） ## 数据来源 ### 提示词来源所有提示词均来自使用Midjourney、CIVITAI等工具的专业设计师。 ### 提示词的多组合式能力标签每条提示词对应的所有标签均经过人类标注员验证。 ### 生成图像使用全部1600条GenAI-Bench提示词，从DALLE_3、DeepFloyd_I_XL_v1、Midjourney_6、SDXL_2_1、SDXL_Base以及SDXL_Turbo生成图像。 ### 生成视频使用全部800条GenAI-Bench提示词，从Pika、Gen2、ModelScope以及Floor33生成视频。 ### 人类评分我们聘请了三名经过培训的人类标注员，对每条生成图像/视频进行独立评分。我们按照当地最低工资标准（每小时12美元）支付报酬，总标注时长约为800小时。 ## 数据集构建 ### 整体流程 ![image/png](https://huggingface.co/datasets/BaiqiL/GenAI-Bench-pictures/resolve/main/Dataset%20Construction.jpg) - **提示词收集**：我们从使用Midjourney、CIVITAI等工具的专业设计师处获取提示词，确保提示词覆盖实际应用中的实用能力，且不含主观或不当内容。 - **组合式能力标签标注**：每条GenAI-Bench提示词均被仔细标注所有待评估的能力标签。我们随后使用SD-XL、Gen2等主流模型生成图像与视频，并遵循标准标注协议，收集生成视觉内容与输入文本提示词的对齐程度的1至5分李克特量表评分。 - **图像/视频收集与人类评分**：我们使用SD-XL、Gen2等主流模型生成图像与视频，并遵循标准标注协议，收集生成视觉内容与输入文本提示词的对齐程度的1至5分李克特量表评分。 # 排行榜 <img src="https://huggingface.co/datasets/BaiqiL/GenAI-Bench-pictures/resolve/main/vqascore_leaderboard.jpg" alt="leaderboard" width="500"/> ## 开源许可 Apache-2.0 ## 维护我们将持续更新GenAI-Bench基准数据集。若您对该数据集有任何疑问或发现问题，请联系[Baiqi Li](mailto:libaiqi123@gmail.com)或[Zhiqiu Lin](mailto:zhiqiul@andrew.cmu.edu)。我们团队将长期维护该数据集，确保其质量！

提供机构：

maas

创建时间：

2025-04-11

搜集汇总

数据集介绍