GenAI-Bench-1600
收藏魔搭社区2026-05-01 更新2025-04-12 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/GenAI-Bench-1600
下载链接
链接失效反馈官方服务:
资源简介:
# ***GenAI-Bench: Evaluating and Improving Compositional Text-to-Visual Generation***
---
<div align="center">
Baiqi Li<sup>1*</sup>, Zhiqiu Lin<sup>1,2*</sup>, Deepak Pathak<sup>1</sup>, Jiayao Li<sup>1</sup>, Yixin Fei<sup>1</sup>, Kewen Wu<sup>1</sup>, Tiffany Ling<sup>1</sup>, Xide Xia<sup>2†</sup>, Pengchuan Zhang<sup>2†</sup>, Graham Neubig<sup>1†</sup>, and Deva Ramanan<sup>1†</sup>.
</div>
<div align="center" style="font-weight:bold;">
<sup>1</sup>Carnegie Mellon University, <sup>2</sup>Meta
</div>
<!--  -->
## Links:
<div align="center">
[**📖Paper**](https://arxiv.org/pdf/2406.13743) | | [🏠**Home Page**](https://linzhiqiu.github.io/papers/genai_bench) | | [🔍**GenAI-Bench Dataset Viewer**](https://huggingface.co/spaces/BaiqiL/GenAI-Bench-DataViewer) | [**🏆Leaderboard**](#Leaderboard) |
</div>
<div align="center">
[🗂️GenAI-Bench-1600(ZIP format)](https://huggingface.co/datasets/BaiqiL/GenAI-Bench-1600) | | [🗂️GenAI-Bench-Video(ZIP format)](https://huggingface.co/datasets/zhiqiulin/GenAI-Bench-800) | | [🗂️GenAI-Bench-Ranking(ZIP format)](https://huggingface.co/datasets/zhiqiulin/GenAI-Image-Ranking-800)
</div>
## 🚩 **News**
- ✅ Aug. 18, 2024. 💥 GenAI-Bench-1600 is used by 🧨 [**Imagen 3**](https://arxiv.org/abs/2408.07009) !
- ✅ Jun. 19, 2024. 💥 Our [paper](https://openreview.net/pdf?id=hJm7qnW3ym) won the **Best Paper** award at the **CVPR SynData4CV workshop** !
## Usage
```python
# load the GenAI-Bench(GenAI-Bench-1600) benchmark
from datasets import load_dataset
dataset = load_dataset("BaiqiL/GenAI-Bench")
```
## Citation Information
```
@article{li2024genai,
title={Genai-bench: Evaluating and improving compositional text-to-visual generation},
author={Li, Baiqi and Lin, Zhiqiu and Pathak, Deepak and Li, Jiayao and Fei, Yixin and Wu, Kewen and Ling, Tiffany and Xia, Xide and Zhang, Pengchuan and Neubig, Graham and others},
journal={arXiv preprint arXiv:2406.13743},
year={2024}
}
```


## Description:
Our dataset consists of three parts: **GenAI-Bench (Gen-Bench-1600)**, **GenAI-Bench-Video**, and **GenAI-Bench-Ranking**, with Gen-Bench-1600 being the primary dataset. For detailed processing methods of the above datasets of zip format, please refer to `dataset.py` in [code](https://github.com/linzhiqiu/t2v_metrics).
[**GenAI-Bench benchmark (GenAI-Bench-1600)**](https://huggingface.co/datasets/BaiqiL/GenAI-Bench-1600) consists of 1,600 challenging real-world text prompts sourced from professional designers. Compared to benchmarks such as PartiPrompt and T2I-CompBench, GenAI-Bench captures a wider range of aspects in the compositional text-to-visual generation, ranging from _basic_ (scene, attribute, relation) to _advanced_ (counting, comparison, differentiation, logic). GenAI-Bench benchmark also collects human alignment ratings (1-to-5 Likert scales) on images and videos generated by ten leading models, such as Stable Diffusion, DALL-E 3, Midjourney v6, Pika v1, and Gen2.
GenAI-Bench:
- Prompt: 1600 prompts sourced from professional designers.
- Compositional Skill Tags: Multiple compositional tags for each prompt. The compositional skill tags are categorized into **_Basic Skill_** and **_Advanced Skill_**. For detailed definitions and examples, please refer to [our paper]().
- Images: Generated images are collected from DALLE_3, DeepFloyd_I_XL_v1, Midjourney_6, SDXL_2_1, SDXL_Base and SDXL_Turbo.
- Human Ratings: 1-to-5 Likert scale ratings for each image.
**(Other Datasets: [GenAI-Bench-Video](https://huggingface.co/datasets/zhiqiulin/GenAI-Bench-800) | [GenAI-Bench-Ranking](https://huggingface.co/datasets/zhiqiulin/GenAI-Image-Ranking-800))**
### Languages
English
### Supported Tasks
Text-to-Visual Generation; Evaluation for Automated Evaluation Metrics.
### Comparing GenAI-Bench to Existing Text-to-Visual Benchmarks

## Dataset Structure
### Data Instances
```
Dataset({
features: ['Index', 'Prompt', 'Tags', 'HumanRatings', 'DALLE_3', 'DeepFloyd_I_XL_v1', 'Midjourney_6', 'SDXL_2_1', 'SDXL_Base', 'SDXL_Turbo'],
num_rows: 1600
})
```
### Data Fields
Name | Explanation
--- | ---
`Index` | **Description:** the unique ID of an example. **Data type:** string
`Prompt` | **Description:** prompt. **Data type:** string
`Tags` | **Description:** basic skills in the prompt. **Data type:** dict
`basic_skills` | **Description:** basic skills in the prompt. **Data type:** list
`advanced_skills` | **Description:** advanced skills in the prompt. **Data type:** list
`DALLE_3` | **Description:** generated image from DALLE3. **Data type:** PIL.JpegImagePlugin.JpegImageFile
`Midjourney_6` | **Description:** generated image from Midjourney_6. **Data type:** PIL.JpegImagePlugin.JpegImageFile
`DeepFloyd_I_XL_v1` | **Description:** generated image from DeepFloyd_I_XL_v1. **Data type:** PIL.JpegImagePlugin.JpegImageFile
`SDXL_2_1` | **Description:** generated image from SDXL_2_1. **Data type:** PIL.JpegImagePlugin.JpegImageFile
`SDXL_Base` | **Description:** generated image from SDXL_Base. **Data type:** PIL.JpegImagePlugin.JpegImageFile
`SDXL_Turbo` | **Description:** generated image from SDXL_Turbo. **Data type:** PIL.JpegImagePlugin.JpegImageFile
`HumanRatings` | **Description:** human ratings for matching between prrompt and image. **Data type:** dict
`DALLE_3` | **Description:** human ratings for matching between prrompt and image. **Data type:** list
`SDXL_Turbo` | **Description:** human ratings for matching between prrompt and image. **Data type:** list
`Midjourney_6` | **Description:** human ratings for matching between prrompt and image. **Data type:** list
`DeepFloyd_I_XL_v1` | **Description:** human ratings for matching between prrompt and image. **Data type:** list
`SDXL_2_1` | **Description:** human ratings for matching between prrompt and image. **Data type:** list
`SDXL_Base` | **Description:** human ratings for matching between prrompt and image. **Data type:** list
### Statistics
Dataset | Number of Prompts | Number of Skill Tags | Number of Images | Number of Videos| Number of Human Ratings|
---| ---: | ---: | ---: | ---: | ---:
GenAI-Bench| 1600 | 5,000+ | 9,600 | -- |28,800
GenAI-Bench-Video| 800 | 2,500+ | -- | 3,200 |9,600
GenAI-Ranking| 800 | 2,500+ | 14,400 | -- |43,200
(each prompt-image/video pair has three human ratings.)
## Data Source
### Prompts
All prompts are sourced from professional designers who use tools such as Midjourney and CIVITAI.
### Multiple Compositional Tags for Prompts
All tags on each prompt are verified by human annotators.
### Generated Images
Generating images using all 1,600 GenAI-Bench prompts from DALLE_3, DeepFloyd_I_XL_v1, Midjourney_6, SDXL_2_1, SDXL_Base and SDXL_Turbo.
### Generated Videos
Generated Videos using all 800 GenAI-Bench prompts from Pika, Gen2, ModelScope and Floor33.
### Human Ratings
We hired three trained human annotators to individually rate each generated image/video. We pay the local minimum wage of 12 dollars per hour for a total of about 800 annotator hours.
## Dataset Construction
### Overall Process

- **Prompt Collecting:** we source prompts from professional designers who use tools such as Midjourney and CIVITAI. This ensures the prompts encompass practical skills relevant to real-world applications and are free of subjective or inappropriate content.
- **Compositional Skills Tagging:** each GenAI-Bench prompt is carefully tagged with all its evaluated skills. We then generate images and videos using state-of-the-art models like SD-XL and Gen2. We follow the recommended annotation protocol to collect 1-to-5 Likert scale ratings for how well the generated visuals align with the input text prompts.
- **Image/Video Collecting and Human Rating:** we then generate images and videos using state-of-the-art models like SD-XL and Gen2. We follow the recommended annotation protocol to collect 1-to-5 Likert scale ratings for how well the generated visuals align with the input text prompts.
# Leaderboard
<img src="https://huggingface.co/datasets/BaiqiL/GenAI-Bench-pictures/resolve/main/vqascore_leaderboard.jpg" alt="leaderboard" width="500"/>
## Licensing Information
apache-2.0
## Maintenance
We will continuously update the GenAI-Bench benchmark. If you have any questions about the dataset or notice any issues, please feel free to contact [Baiqi Li](mailto:libaiqi123@gmail.com) or [Zhiqiu Lin](mailto:zhiqiul@andrew.cmu.edu). Our team is committed to maintaining this dataset in the long run to ensure its quality!
# ***GenAI-Bench:评估与改进组合式文本到视觉生成***
---
<div align="center">
Baiqi Li<sup>1*</sup>, Zhiqiu Lin<sup>1,2*</sup>, Deepak Pathak<sup>1</sup>, Jiayao Li<sup>1</sup>, Yixin Fei<sup>1</sup>, Kewen Wu<sup>1</sup>, Tiffany Ling<sup>1</sup>, Xide Xia<sup>2†</sup>, Pengchuan Zhang<sup>2†</sup>, Graham Neubig<sup>1†</sup>, 和 Deva Ramanan<sup>1†</sup>.
</div>
<div align="center" style="font-weight:bold;">
<sup>1</sup>卡内基梅隆大学,<sup>2</sup>Meta
</div>
<!--  -->
## 链接:
<div align="center">
[**📖论文**](https://arxiv.org/pdf/2406.13743) | | [🏠**项目主页**](https://linzhiqiu.github.io/papers/genai_bench) | | [🔍**GenAI-Bench数据集查看器**](https://huggingface.co/spaces/BaiqiL/GenAI-Bench-DataViewer) | [**🏆排行榜**](#排行榜) |
</div>
<div align="center">
[🗂️GenAI-Bench-1600(ZIP格式)](https://huggingface.co/datasets/BaiqiL/GenAI-Bench-1600) | | [🗂️GenAI-Bench-Video(ZIP格式)](https://huggingface.co/datasets/zhiqiulin/GenAI-Bench-800) | | [🗂️GenAI-Bench-Ranking(ZIP格式)](https://huggingface.co/datasets/zhiqiulin/GenAI-Image-Ranking-800)
</div>
## 🚩 **新闻**
- ✅ 2024年8月18日。 💥 GenAI-Bench-1600被🧨 [**Imagen 3**](https://arxiv.org/abs/2408.07009) 采用!
- ✅ 2024年6月19日。 💥 我们的[论文](https://openreview.net/pdf?id=hJm7qnW3ym)在**CVPR SynData4CV研讨会**上斩获**最佳论文奖**!
## 使用方法
python
# 加载GenAI-Bench(GenAI-Bench-1600)基准数据集
from datasets import load_dataset
dataset = load_dataset("BaiqiL/GenAI-Bench")
## 引用信息
{li2024genai,
title={GenAI-Bench: Evaluating and Improving Compositional Text-to-Visual Generation},
author={Li, Baiqi and Lin, Zhiqiu and Pathak, Deepak and Li, Jiayao and Fei, Yixin and Wu, Kewen and Ling, Tiffany and Xia, Xide and Zhang, Pengchuan and Neubig, Graham and others},
journal={arXiv preprint arXiv:2406.13743},
year={2024}
}


## 数据集描述:
我们的数据集包含三个子模块:**GenAI-Bench(Gen-Bench-1600)**、**GenAI-Bench-Video**以及**GenAI-Bench-Ranking**,其中Gen-Bench-1600为核心基准数据集。如需了解上述ZIP格式数据集的详细处理方法,请参阅[代码仓库](https://github.com/linzhiqiu/t2v_metrics)中的`dataset.py`文件。
[**GenAI-Bench基准数据集(GenAI-Bench-1600)**](https://huggingface.co/datasets/BaiqiL/GenAI-Bench-1600)包含1600条来自专业设计师的高难度真实世界文本提示词。与PartiPrompt、T2I-CompBench等现有基准相比,GenAI-Bench覆盖了组合式文本到视觉生成任务中更广泛的能力维度,从**基础能力**(场景、属性、关系)到**高级能力**(计数、对比、区分、逻辑推理)均有涉及。该基准还收集了10个主流模型生成的图像与视频的人类对齐评分(1至5分李克特量表),涉及模型包括Stable Diffusion、DALL-E 3、Midjourney v6、Pika v1以及Gen2。
GenAI-Bench:
- 提示词:1600条来自专业设计师的提示词。
- 组合式能力标签:每条提示词对应多个组合式标签。这些标签被划分为**_基础能力标签_**与**_高级能力标签_**,详细定义与示例请参阅[我们的论文]()。
- 生成图像:收集了来自DALLE_3、DeepFloyd_I_XL_v1、Midjourney_6、SDXL_2_1、SDXL_Base以及SDXL_Turbo的生成图像。
- 人类评分:每条生成图像对应的1至5分李克特量表评分。
**(其他数据集:[GenAI-Bench-Video](https://huggingface.co/datasets/zhiqiulin/GenAI-Bench-800) | [GenAI-Bench-Ranking](https://huggingface.co/datasets/zhiqiulin/GenAI-Image-Ranking-800))**
### 语言
英语
### 支持任务
文本到视觉生成;自动化评估指标的评估。
### GenAI-Bench与现有文本到视觉基准的对比

## 数据集结构
### 数据实例
Dataset({
features: ['索引', '提示词', '标签', '人类评分', 'DALLE_3', 'DeepFloyd_I_XL_v1', 'Midjourney_6', 'SDXL_2_1', 'SDXL_Base', 'SDXL_Turbo'],
num_rows: 1600
})
### 数据字段
名称 | 说明
--- | ---
`索引` | **说明:** 样本的唯一ID。 **数据类型:** 字符串
`提示词` | **说明:** 文本提示词。 **数据类型:** 字符串
`标签` | **说明:** 提示词中的基础能力标签。 **数据类型:** 字典
`基础能力标签` | **说明:** 提示词中的基础能力标签。 **数据类型:** 列表
`高级能力标签` | **说明:** 提示词中的高级能力标签。 **数据类型:** 列表
`DALLE_3` | **说明:** DALLE3生成的图像。 **数据类型:** PIL.JpegImagePlugin.JpegImageFile
`Midjourney_6` | **说明:** Midjourney_6生成的图像。 **数据类型:** PIL.JpegImagePlugin.JpegImageFile
`DeepFloyd_I_XL_v1` | **说明:** DeepFloyd_I_XL_v1生成的图像。 **数据类型:** PIL.JpegImagePlugin.JpegImageFile
`SDXL_2_1` | **说明:** SDXL_2_1生成的图像。 **数据类型:** PIL.JpegImagePlugin.JpegImageFile
`SDXL_Base` | **说明:** SDXL_Base生成的图像。 **数据类型:** PIL.JpegImagePlugin.JpegImageFile
`SDXL_Turbo` | **说明:** SDXL_Turbo生成的图像。 **数据类型:** PIL.JpegImagePlugin.JpegImageFile
`人类评分` | **说明:** 提示词与图像匹配度的人类评分。 **数据类型:** 字典
`DALLE_3` | **说明:** 提示词与图像匹配度的人类评分。 **数据类型:** 列表
`SDXL_Turbo` | **说明:** 提示词与图像匹配度的人类评分。 **数据类型:** 列表
`Midjourney_6` | **说明:** 提示词与图像匹配度的人类评分。 **数据类型:** 列表
`DeepFloyd_I_XL_v1` | **说明:** 提示词与图像匹配度的人类评分。 **数据类型:** 列表
`SDXL_2_1` | **说明:** 提示词与图像匹配度的人类评分。 **数据类型:** 列表
`SDXL_Base` | **说明:** 提示词与图像匹配度的人类评分。 **数据类型:** 列表
### 统计信息
| 数据集 | 提示词数量 | 标签数量 | 图像数量 | 视频数量 | 人类评分数量 |
|---| ---: | ---: | ---: | ---: | ---:
| GenAI-Bench| 1600 | 5,000+ | 9,600 | -- |28,800
| GenAI-Bench-Video| 800 | 2,500+ | -- | 3,200 |9,600
| GenAI-Ranking| 800 | 2,500+ | 14,400 | -- |43,200
(每条提示词-图像/视频对应三个人类评分。)
## 数据来源
### 提示词来源
所有提示词均来自使用Midjourney、CIVITAI等工具的专业设计师。
### 提示词的多组合式能力标签
每条提示词对应的所有标签均经过人类标注员验证。
### 生成图像
使用全部1600条GenAI-Bench提示词,从DALLE_3、DeepFloyd_I_XL_v1、Midjourney_6、SDXL_2_1、SDXL_Base以及SDXL_Turbo生成图像。
### 生成视频
使用全部800条GenAI-Bench提示词,从Pika、Gen2、ModelScope以及Floor33生成视频。
### 人类评分
我们聘请了三名经过培训的人类标注员,对每条生成图像/视频进行独立评分。我们按照当地最低工资标准(每小时12美元)支付报酬,总标注时长约为800小时。
## 数据集构建
### 整体流程

- **提示词收集**:我们从使用Midjourney、CIVITAI等工具的专业设计师处获取提示词,确保提示词覆盖实际应用中的实用能力,且不含主观或不当内容。
- **组合式能力标签标注**:每条GenAI-Bench提示词均被仔细标注所有待评估的能力标签。我们随后使用SD-XL、Gen2等主流模型生成图像与视频,并遵循标准标注协议,收集生成视觉内容与输入文本提示词的对齐程度的1至5分李克特量表评分。
- **图像/视频收集与人类评分**:我们使用SD-XL、Gen2等主流模型生成图像与视频,并遵循标准标注协议,收集生成视觉内容与输入文本提示词的对齐程度的1至5分李克特量表评分。
# 排行榜
<img src="https://huggingface.co/datasets/BaiqiL/GenAI-Bench-pictures/resolve/main/vqascore_leaderboard.jpg" alt="leaderboard" width="500"/>
## 开源许可
Apache-2.0
## 维护
我们将持续更新GenAI-Bench基准数据集。若您对该数据集有任何疑问或发现问题,请联系[Baiqi Li](mailto:libaiqi123@gmail.com)或[Zhiqiu Lin](mailto:zhiqiul@andrew.cmu.edu)。我们团队将长期维护该数据集,确保其质量!
提供机构:
maas
创建时间:
2025-04-11
搜集汇总
数据集介绍

背景与挑战
背景概述
GenAI-Bench-1600是一个包含1600个专业设计师提供的文本提示的数据集,用于评估文本到视觉生成的组合性。数据集包含由多个领先模型生成的图像和人类评分,涵盖了基础和高级的组合技能。
以上内容由遇见数据集搜集并总结生成



