five

zhangshengchoa/genarena

收藏
Hugging Face2026-04-01 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/zhangshengchoa/genarena
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 task_categories: - image-text-to-image - image-to-image - text-to-image language: - en size_categories: - 1K<n<10K --- # GenArena A unified evaluation framework for visual generation tasks using VLM-based pairwise comparison and Elo ranking. [![arXiv](https://img.shields.io/badge/arXiv-2602.XXXXX-b31b1b.svg)](https://arxiv.org/abs/2602.XXXXX) [![Project Page](https://img.shields.io/badge/Project-Website-orange)](https://genarena.github.io) [![Leaderboard](https://img.shields.io/badge/%F0%9F%8F%86%20Leaderboard-Live-brightgreen)](https://huggingface.co/spaces/genarena/leaderboard) [![Hugging Face](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-GenArena-yellow)](https://huggingface.co/datasets/rhli/genarena) ## Abstract The rapid advancement of visual generation models has outpaced traditional evaluation approaches, necessitating the adoption of Vision-Language Models as surrogate judges. In this work, we systematically investigate the reliability of the prevailing absolute pointwise scoring standard, across a wide spectrum of visual generation tasks. Our analysis reveals that this paradigm is limited due to stochastic inconsistency and poor alignment with human perception. To resolve these limitations, we introduce **GenArena**, a unified evaluation framework that leverages a *pairwise comparison* paradigm to ensure stable and human-aligned evaluation. Crucially, our experiments uncover a transformative finding that simply adopting this pairwise protocol enables off-the-shelf open-source models to outperform top-tier proprietary models. Notably, our method boosts evaluation accuracy by over 20% and achieves a Spearman correlation of 0.86 with the authoritative LMArena leaderboard, drastically surpassing the 0.36 correlation of pointwise methods. Based on GenArena, we benchmark state-of-the-art visual generation models across diverse tasks, providing the community with a rigorous and automated evaluation standard for visual generation. ## Quick Start ### Installation ```bash pip install genarena ``` Or install from source: ```bash git clone https://github.com/ruihanglix/genarena.git cd genarena pip install -e . ``` ### Initialize Arena Download benchmark data and official arena data with one command: ```bash genarena init --arena_dir ./arena --data_dir ./data ``` This downloads: - Benchmark Parquet data from `rhli/genarena` (HuggingFace) - Official arena data (model outputs + battle logs) from `rhli/genarena-battlefield` ### Environment Setup Set your VLM API credentials: ```bash export OPENAI_API_KEY="your-api-key" export OPENAI_BASE_URL="https://api.example.com/v1" ``` For multi-endpoint support (load balancing and failover), use comma-separated values: ```bash export OPENAI_BASE_URLS="https://api1.example.com/v1,https://api2.example.com/v1" export OPENAI_API_KEYS="key1,key2,key3" ``` ### Run Evaluation ```bash genarena run --arena_dir ./arena --data_dir ./data ``` ### View Leaderboard ```bash genarena leaderboard --arena_dir ./arena --subset basic ``` ### Check Status ```bash genarena status --arena_dir ./arena --data_dir ./data ``` ## Running Your Own Experiments ### Directory Structure To add your own model for evaluation, organize outputs in the following structure: ``` arena_dir/ └── <subset>/ └── models/ └── <GithubID>_<modelName>_<yyyymmdd>/ └── <model_name>/ ├── 000000.png ├── 000001.png └── ... ``` For example: ``` arena/basic/models/johndoe_MyNewModel_20260205/MyNewModel/ ``` ### Generate Images with Diffgentor Use [Diffgentor](https://github.com/ruihanglix/diffgentor) to batch generate images for evaluation: ```bash # Download benchmark data hf download rhli/genarena --repo-type dataset --local-dir ./data # Generate images with your model diffgentor edit --backend diffusers \ --model_name YourModel \ --input ./data/basic/ \ --output_dir ./arena/basic/models/yourname_YourModel_20260205/YourModel/ ``` ### Run Battles for New Models ```bash genarena run --arena_dir ./arena --data_dir ./data \ --subset basic \ --exp_name yourname_YourModel_20260205 ``` GenArena automatically detects new models and schedules battles against existing models. ## Submit to Official Leaderboard > **Coming Soon**: The `genarena submit` command will allow you to submit your evaluation results to the official GenArena leaderboard via GitHub PR. The workflow will be: 1. Run evaluation locally with `genarena run` 2. Upload results to your HuggingFace repository 3. Submit via `genarena submit` which creates a PR for review ## Citation ```bibtex TBD ```

许可证:Apache-2.0 任务类别: - 图像-文本到图像 - 图像到图像 - 文本到图像 语言:英语 规模类别:1K<n<10K # GenArena 一款基于视觉语言模型(Vision-Language Model,VLM)成对比较与Elo排名的视觉生成任务统一评估框架。 [![arXiv](https://img.shields.io/badge/arXiv-2602.XXXXX-b31b1b.svg)](https://arxiv.org/abs/2602.XXXXX) [![Project Page](https://img.shields.io/badge/Project-Website-orange)](https://genarena.github.io) [![Leaderboard](https://img.shields.io/badge/%F0%9F%8F%86%20Leaderboard-Live-brightgreen)](https://huggingface.co/spaces/genarena/leaderboard) [![Hugging Face](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-GenArena-yellow)](https://huggingface.co/datasets/rhli/genarena) ## 摘要 视觉生成模型的快速迭代已然超越传统评估范式的发展节奏,亟需采用视觉语言模型作为替代评判者。本研究系统探究了当前主流的逐点绝对评分标准在多类视觉生成任务中的可靠性。经分析发现,该范式存在随机不一致性,且与人类感知的对齐度欠佳。为解决上述局限,我们提出**GenArena**——一款采用*成对比较*范式的统一评估框架,可确保评估结果稳定且与人类感知高度对齐。尤为关键的是,我们的实验揭示了一项突破性结论:仅采用该成对比较协议,即可使现成的开源模型超越顶级闭源模型。值得一提的是,本方法将评估准确率提升20%以上,与权威LMArena排行榜的斯皮尔曼相关系数达到0.86,远高于逐点方法的0.36。基于GenArena,我们对各类任务中的前沿视觉生成模型开展了基准测试,为学界提供了一套严谨且自动化的视觉生成评估标准。 ## 快速上手 ### 安装 bash pip install genarena 或从源码安装: bash git clone https://github.com/ruihanglix/genarena.git cd genarena pip install -e . ### 初始化竞技场 通过单条命令下载基准数据与官方竞技场数据: bash genarena init --arena_dir ./arena --data_dir ./data 本次下载包含: - 来自Hugging Face平台`rhli/genarena`的基准Parquet数据 - 来自`rhli/genarena-battlefield`的官方竞技场数据(模型输出与对战日志) ### 环境配置 设置视觉语言模型的API凭证: bash export OPENAI_API_KEY="your-api-key" export OPENAI_BASE_URL="https://api.example.com/v1" 如需支持多端点(负载均衡与故障转移),可使用逗号分隔的取值: bash export OPENAI_BASE_URLS="https://api1.example.com/v1,https://api2.example.com/v1" export OPENAI_API_KEYS="key1,key2,key3" ### 运行评估 bash genarena run --arena_dir ./arena --data_dir ./data ### 查看排行榜 bash genarena leaderboard --arena_dir ./arena --subset basic ### 检查运行状态 bash genarena status --arena_dir ./arena --data_dir ./data ## 开展自定义实验 ### 目录结构 如需添加自定义模型进行评估,请按以下结构组织模型输出: arena_dir/ └── <subset>/ └── models/ └── <GithubID>_<modelName>_<yyyymmdd>/ └── <model_name>/ ├── 000000.png ├── 000001.png └── ... 示例如下: arena/basic/models/johndoe_MyNewModel_20260205/MyNewModel/ ### 使用Diffgentor生成图像 可使用[Diffgentor](https://github.com/ruihanglix/diffgentor)批量生成待评估图像: bash # 下载基准数据 hf download rhli/genarena --repo-type dataset --local-dir ./data # 使用自定义模型生成图像 diffgentor edit --backend diffusers --model_name YourModel --input ./data/basic/ --output_dir ./arena/basic/models/yourname_YourModel_20260205/YourModel/ ### 为新模型运行对战 bash genarena run --arena_dir ./arena --data_dir ./data --subset basic --exp_name yourname_YourModel_20260205 GenArena将自动识别新模型,并安排其与现有模型进行对战。 ## 提交至官方排行榜 > **即将推出**:`genarena submit`命令将支持您通过GitHub拉取请求(PR)将评估结果提交至官方GenArena排行榜。 提交流程如下: 1. 使用`genarena run`在本地运行评估 2. 将结果上传至您的Hugging Face仓库 3. 通过`genarena submit`提交请求,该命令将自动创建PR以供审核 ## 引用 bibtex TBD
提供机构:
zhangshengchoa
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作