five

GenAI-Bench

收藏
魔搭社区2025-12-05 更新2025-02-08 收录
下载链接:
https://modelscope.cn/datasets/TIGER-Lab/GenAI-Bench
下载链接
链接失效反馈
官方服务:
资源简介:
# GenAI-Bench [Paper](https://arxiv.org/abs/2406.04485) | [🤗 GenAI Arena](https://huggingface.co/spaces/TIGER-Lab/GenAI-Arena) | [Github](https://github.com/TIGER-AI-Lab/GenAI-Bench) ## Introduction GenAI-Bench is a benchmark designed to benchmark MLLMs’s ability in judging the quality of AI generative contents by comparing with human preferences collected through our [🤗 GenAI-Arnea](https://huggingface.co/spaces/TIGER-Lab/GenAI-Arena). In other words, we are evaluting the capabilities of existing MLLMs as a multimodal reward model, and in this view, GenAI-Bench is a reward-bench for multimodal generative models. We filter existing votes collecte visa NSFW filter and other heuristics, and then finally resulting in 1735 votes for image generation, 919 votes for image editing, and 1069 votes for video generation, which is used to evaluate the performance of MLLMs on aligning with human preferences. We adopts a pairwise comparison template for each tasks, where the model is asked to output 4 labels for each pair of AI generative contents, which are `A>B`, `B>A`, `A=B=Good`, `A=B=Bad`. We then calculate the average accuracy of the model by comparing the model's prediction with the human preference. The prompt templates are shown below: - [Image Generation](https://github.com/TIGER-AI-Lab/GenAI-Bench/blob/main/genaibench/templates/image_generation/pairwise.txt) - [Image Editing](https://github.com/TIGER-AI-Lab/GenAI-Bench/blob/main/genaibench/templates/image_edition/pairwise.txt) - [Video Generation](https://github.com/TIGER-AI-Lab/GenAI-Bench/blob/main/genaibench/templates/video_generation/pairwise.txt) ## Evaluate a new model Please refer to our Github READMD: [#evaluate-a-model](https://github.com/TIGER-AI-Lab/GenAI-Bench?tab=readme-ov-file#evaluate-a-model) ## Contribute a new model Please refer to our Github READMD: [#contributing-a-new-model](https://github.com/TIGER-AI-Lab/GenAI-Bench?tab=readme-ov-file#contributing-a-new-model) ## Current Leaderboard (on `test_v1` split) (Updated on 2024-08-09) | Model | Template | Image Generation | Image Editing | Video Generation | Average | | :---------------------: | :------: | :--------------: | :-----------: | :--------------: | :-----: | | random | pairwise | 25.36 | 25.9 | 25.16 | 25.47 | | gpt4o | pairwise | 45.59 | 53.54 | 48.46 | 49.2 | | gemini-1.5-pro | pairwise | 44.67 | 55.93 | 46.21 | 48.94 | | llava | pairwise | 37.0 | 26.12 | 30.4 | 31.17 | | idefics2 | pairwise | 42.25 | 27.31 | 16.46 | 28.67 | | llavanext | pairwise | 22.65 | 25.35 | 21.7 | 23.23 | | minicpm-V-2.5 | pairwise | 37.81 | 25.24 | 6.55 | 23.2 | | blip2 | pairwise | 26.34 | 26.01 | 16.93 | 23.09 | | videollava | pairwise | 37.75 | 26.66 | 0.0 | 21.47 | | cogvlm | pairwise | 29.34 | 0.0 | 24.6 | 17.98 | | qwenVL | pairwise | 26.63 | 14.91 | 2.15 | 14.56 | | instructblip | pairwise | 3.11 | 19.8 | 3.74 | 8.88 | | idefics1 | pairwise | 0.81 | 5.66 | 0.19 | 2.22 | | ottervideo | pairwise | 0.0 | 0.0 | 0.0 | 0.0 | | otterimage | pairwise | 0.0 | 0.0 | 0.0 | 0.0 | | kosmos2 | pairwise | 0.0 | 0.0 | 0.0 | 0.0 | ## Citation ```bibtex @article{jiang2024genai, title={GenAI Arena: An Open Evaluation Platform for Generative Models}, author={Jiang, Dongfu and Ku, Max and Li, Tianle and Ni, Yuansheng and Sun, Shizhuo and Fan, Rongqi and Chen, Wenhu}, journal={arXiv preprint arXiv:2406.04485}, year={2024} } ```

# GenAI-Bench(GenAI基准测试集) [论文](https://arxiv.org/abs/2406.04485) | [🤗 Hugging Face GenAI竞技场](https://huggingface.co/spaces/TIGER-Lab/GenAI-Arena) | [GitHub仓库](https://github.com/TIGER-AI-Lab/GenAI-Bench) ## 简介 GenAI-Bench是一款用于评估多模态大语言模型(Multimodal Large Language Model, MLLM)的基准测试集,其评估方式为通过对比模型对AI生成内容的质量判断结果与我们通过[🤗 Hugging Face GenAI竞技场](https://huggingface.co/spaces/TIGER-Lab/GenAI-Arena)收集的人类偏好数据。换言之,我们旨在评估现有多模态大语言模型作为多模态奖励模型的性能,从这一角度而言,GenAI-Bench是一款面向多模态生成模型的奖励基准测试集。 我们通过NSFW(Not Safe For Work)过滤器与其他启发式规则对已有的投票数据进行筛选,最终得到1735条图像生成任务投票、919条图像编辑任务投票以及1069条视频生成任务投票,这些数据被用于评估多模态大语言模型与人类偏好对齐的性能。 本基准测试为所有任务采用了成对比较的提示模板:模型需要为每一对AI生成内容输出4种标签之一,分别为`A>B`、`B>A`、`A=B=Good`、`A=B=Bad`。随后我们通过将模型预测结果与人类偏好进行对比,计算模型的平均准确率。 提示模板如下: - [图像生成](https://github.com/TIGER-AI-Lab/GenAI-Bench/blob/main/genaibench/templates/image_generation/pairwise.txt) - [图像编辑](https://github.com/TIGER-AI-Lab/GenAI-Bench/blob/main/genaibench/templates/image_edition/pairwise.txt) - [视频生成](https://github.com/TIGER-AI-Lab/GenAI-Bench/blob/main/genaibench/templates/video_generation/pairwise.txt) ## 评估新模型 请参考我们的GitHub README文档的【评估模型】章节:[#evaluate-a-model](https://github.com/TIGER-AI-Lab/GenAI-Bench?tab=readme-ov-file#evaluate-a-model) ## 贡献新模型 请参考我们的GitHub README文档的【贡献模型】章节:[#contributing-a-new-model](https://github.com/TIGER-AI-Lab/GenAI-Bench?tab=readme-ov-file#contributing-a-new-model) ## 当前排行榜(基于`test_v1`划分集) (更新于2024年8月9日) | 模型 | 提示模板 | 图像生成准确率 | 图像编辑准确率 | 视频生成准确率 | 平均准确率 | | :---------------------: | :------: | :--------------: | :-----------: | :--------------: | :-----: | | 随机模型 | 成对比较 | 25.36 | 25.9 | 25.16 | 25.47 | | GPT-4o | 成对比较 | 45.59 | 53.54 | 48.46 | 49.2 | | Gemini 1.5 Pro | 成对比较 | 44.67 | 55.93 | 46.21 | 48.94 | | LLaVA | 成对比较 | 37.0 | 26.12 | 30.4 | 31.17 | | Idefics2 | 成对比较 | 42.25 | 27.31 | 16.46 | 28.67 | | LLaVA-NeXT | 成对比较 | 22.65 | 25.35 | 21.7 | 23.23 | | MiniCPM-V 2.5 | 成对比较 | 37.81 | 25.24 | 6.55 | 23.2 | | BLIP-2 | 成对比较 | 26.34 | 26.01 | 16.93 | 23.09 | | VideoLLaVA | 成对比较 | 37.75 | 26.66 | 0.0 | 21.47 | | CogVLM | 成对比较 | 29.34 | 0.0 | 24.6 | 17.98 | | Qwen-VL | 成对比较 | 26.63 | 14.91 | 2.15 | 14.56 | | InstructBLIP | 成对比较 | 3.11 | 19.8 | 3.74 | 8.88 | | Idefics1 | 成对比较 | 0.81 | 5.66 | 0.19 | 2.22 | | OtterVideo | 成对比较 | 0.0 | 0.0 | 0.0 | 0.0 | | OtterImage | 成对比较 | 0.0 | 0.0 | 0.0 | 0.0 | | Kosmos-2 | 成对比较 | 0.0 | 0.0 | 0.0 | 0.0 | ## 引用 bibtex @article{jiang2024genai, title={GenAI Arena: An Open Evaluation Platform for Generative Models}, author={Jiang, Dongfu and Ku, Max and Li, Tianle and Ni, Yuansheng and Sun, Shizhuo and Fan, Rongqi and Chen, Wenhu}, journal={arXiv preprint arXiv:2406.04485}, year={2024} }
提供机构:
maas
创建时间:
2025-02-04
搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
GenAI-Bench是一个用于评估多模态大语言模型(MLLMs)判断AI生成内容质量的基准测试数据集,通过收集人类偏好(包括图像生成、图像编辑和视频生成任务)来评估模型作为多模态奖励模型的性能。数据集采用成对比较模板,要求模型输出四个标签以计算与人类偏好对齐的准确率,并提供了当前模型性能的排行榜。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作