zhangshengchoa/genarena
收藏Hugging Face2026-04-01 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/zhangshengchoa/genarena
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
task_categories:
- image-text-to-image
- image-to-image
- text-to-image
language:
- en
size_categories:
- 1K<n<10K
---
# GenArena
A unified evaluation framework for visual generation tasks using VLM-based pairwise comparison and Elo ranking.
[](https://arxiv.org/abs/2602.XXXXX)
[](https://genarena.github.io)
[](https://huggingface.co/spaces/genarena/leaderboard)
[](https://huggingface.co/datasets/rhli/genarena)
## Abstract
The rapid advancement of visual generation models has outpaced traditional evaluation approaches, necessitating the adoption of Vision-Language Models as surrogate judges. In this work, we systematically investigate the reliability of the prevailing absolute pointwise scoring standard, across a wide spectrum of visual generation tasks. Our analysis reveals that this paradigm is limited due to stochastic inconsistency and poor alignment with human perception. To resolve these limitations, we introduce **GenArena**, a unified evaluation framework that leverages a *pairwise comparison* paradigm to ensure stable and human-aligned evaluation. Crucially, our experiments uncover a transformative finding that simply adopting this pairwise protocol enables off-the-shelf open-source models to outperform top-tier proprietary models. Notably, our method boosts evaluation accuracy by over 20% and achieves a Spearman correlation of 0.86 with the authoritative LMArena leaderboard, drastically surpassing the 0.36 correlation of pointwise methods. Based on GenArena, we benchmark state-of-the-art visual generation models across diverse tasks, providing the community with a rigorous and automated evaluation standard for visual generation.
## Quick Start
### Installation
```bash
pip install genarena
```
Or install from source:
```bash
git clone https://github.com/ruihanglix/genarena.git
cd genarena
pip install -e .
```
### Initialize Arena
Download benchmark data and official arena data with one command:
```bash
genarena init --arena_dir ./arena --data_dir ./data
```
This downloads:
- Benchmark Parquet data from `rhli/genarena` (HuggingFace)
- Official arena data (model outputs + battle logs) from `rhli/genarena-battlefield`
### Environment Setup
Set your VLM API credentials:
```bash
export OPENAI_API_KEY="your-api-key"
export OPENAI_BASE_URL="https://api.example.com/v1"
```
For multi-endpoint support (load balancing and failover), use comma-separated values:
```bash
export OPENAI_BASE_URLS="https://api1.example.com/v1,https://api2.example.com/v1"
export OPENAI_API_KEYS="key1,key2,key3"
```
### Run Evaluation
```bash
genarena run --arena_dir ./arena --data_dir ./data
```
### View Leaderboard
```bash
genarena leaderboard --arena_dir ./arena --subset basic
```
### Check Status
```bash
genarena status --arena_dir ./arena --data_dir ./data
```
## Running Your Own Experiments
### Directory Structure
To add your own model for evaluation, organize outputs in the following structure:
```
arena_dir/
└── <subset>/
└── models/
└── <GithubID>_<modelName>_<yyyymmdd>/
└── <model_name>/
├── 000000.png
├── 000001.png
└── ...
```
For example:
```
arena/basic/models/johndoe_MyNewModel_20260205/MyNewModel/
```
### Generate Images with Diffgentor
Use [Diffgentor](https://github.com/ruihanglix/diffgentor) to batch generate images for evaluation:
```bash
# Download benchmark data
hf download rhli/genarena --repo-type dataset --local-dir ./data
# Generate images with your model
diffgentor edit --backend diffusers \
--model_name YourModel \
--input ./data/basic/ \
--output_dir ./arena/basic/models/yourname_YourModel_20260205/YourModel/
```
### Run Battles for New Models
```bash
genarena run --arena_dir ./arena --data_dir ./data \
--subset basic \
--exp_name yourname_YourModel_20260205
```
GenArena automatically detects new models and schedules battles against existing models.
## Submit to Official Leaderboard
> **Coming Soon**: The `genarena submit` command will allow you to submit your evaluation results to the official GenArena leaderboard via GitHub PR.
The workflow will be:
1. Run evaluation locally with `genarena run`
2. Upload results to your HuggingFace repository
3. Submit via `genarena submit` which creates a PR for review
## Citation
```bibtex
TBD
```
许可证:Apache-2.0
任务类别:
- 图像-文本到图像
- 图像到图像
- 文本到图像
语言:英语
规模类别:1K<n<10K
# GenArena
一款基于视觉语言模型(Vision-Language Model,VLM)成对比较与Elo排名的视觉生成任务统一评估框架。
[](https://arxiv.org/abs/2602.XXXXX)
[](https://genarena.github.io)
[](https://huggingface.co/spaces/genarena/leaderboard)
[](https://huggingface.co/datasets/rhli/genarena)
## 摘要
视觉生成模型的快速迭代已然超越传统评估范式的发展节奏,亟需采用视觉语言模型作为替代评判者。本研究系统探究了当前主流的逐点绝对评分标准在多类视觉生成任务中的可靠性。经分析发现,该范式存在随机不一致性,且与人类感知的对齐度欠佳。为解决上述局限,我们提出**GenArena**——一款采用*成对比较*范式的统一评估框架,可确保评估结果稳定且与人类感知高度对齐。尤为关键的是,我们的实验揭示了一项突破性结论:仅采用该成对比较协议,即可使现成的开源模型超越顶级闭源模型。值得一提的是,本方法将评估准确率提升20%以上,与权威LMArena排行榜的斯皮尔曼相关系数达到0.86,远高于逐点方法的0.36。基于GenArena,我们对各类任务中的前沿视觉生成模型开展了基准测试,为学界提供了一套严谨且自动化的视觉生成评估标准。
## 快速上手
### 安装
bash
pip install genarena
或从源码安装:
bash
git clone https://github.com/ruihanglix/genarena.git
cd genarena
pip install -e .
### 初始化竞技场
通过单条命令下载基准数据与官方竞技场数据:
bash
genarena init --arena_dir ./arena --data_dir ./data
本次下载包含:
- 来自Hugging Face平台`rhli/genarena`的基准Parquet数据
- 来自`rhli/genarena-battlefield`的官方竞技场数据(模型输出与对战日志)
### 环境配置
设置视觉语言模型的API凭证:
bash
export OPENAI_API_KEY="your-api-key"
export OPENAI_BASE_URL="https://api.example.com/v1"
如需支持多端点(负载均衡与故障转移),可使用逗号分隔的取值:
bash
export OPENAI_BASE_URLS="https://api1.example.com/v1,https://api2.example.com/v1"
export OPENAI_API_KEYS="key1,key2,key3"
### 运行评估
bash
genarena run --arena_dir ./arena --data_dir ./data
### 查看排行榜
bash
genarena leaderboard --arena_dir ./arena --subset basic
### 检查运行状态
bash
genarena status --arena_dir ./arena --data_dir ./data
## 开展自定义实验
### 目录结构
如需添加自定义模型进行评估,请按以下结构组织模型输出:
arena_dir/
└── <subset>/
└── models/
└── <GithubID>_<modelName>_<yyyymmdd>/
└── <model_name>/
├── 000000.png
├── 000001.png
└── ...
示例如下:
arena/basic/models/johndoe_MyNewModel_20260205/MyNewModel/
### 使用Diffgentor生成图像
可使用[Diffgentor](https://github.com/ruihanglix/diffgentor)批量生成待评估图像:
bash
# 下载基准数据
hf download rhli/genarena --repo-type dataset --local-dir ./data
# 使用自定义模型生成图像
diffgentor edit --backend diffusers
--model_name YourModel
--input ./data/basic/
--output_dir ./arena/basic/models/yourname_YourModel_20260205/YourModel/
### 为新模型运行对战
bash
genarena run --arena_dir ./arena --data_dir ./data
--subset basic
--exp_name yourname_YourModel_20260205
GenArena将自动识别新模型,并安排其与现有模型进行对战。
## 提交至官方排行榜
> **即将推出**:`genarena submit`命令将支持您通过GitHub拉取请求(PR)将评估结果提交至官方GenArena排行榜。
提交流程如下:
1. 使用`genarena run`在本地运行评估
2. 将结果上传至您的Hugging Face仓库
3. 通过`genarena submit`提交请求,该命令将自动创建PR以供审核
## 引用
bibtex
TBD
提供机构:
zhangshengchoa



