SKYLENAGE-GameCodeGym
收藏魔搭社区2026-05-09 更新2025-09-27 收录
下载链接:
https://modelscope.cn/datasets/Alibaba-DT/SKYLENAGE-GameCodeGym
下载链接
链接失效反馈官方服务:
资源简介:
[](https://skylenage.alibaba-inc.com/sla/home)
[](https://v-gamegym.github.io/index.html)
[](https://v-gamegym.github.io/leaderboard.html)
[](https://arxiv.org/abs/2509.20136)
[](https://github.com/alibaba/SKYLENAGE-GameCodeGym/)
# I. Benchmark Introduction
**SKYLENAGE-GameCodeGym (V-GameGym)** is a comprehensive benchmark for code LLMs, addressing the lack of evaluation in visual game development.
It includes **2,219 samples** across **100 clusters**, curated with a clustering-based method to ensure diversity and completeness.
# II. Benchmark Features
1. **Game-specific metrics**: Playability, aesthetics, and user engagement.
2. **Multimodal evaluation**: LLM-driven visual code synthesis in a UI sandbox.
3. **Validated effectiveness**: Narrows the gap between code accuracy and real workflows.
4. **Quantifiable results**: Provides measurable indicators for visual programming.
# III. LeaderBoard
| Rank | Model Name | Company | Total | Code | Screenshot | Video | Release Date |
|------|-----------------------------------|-------------|-------|-------|------------|-------|--------------|
| 🥇 1 | GPT-5-20250807 | OpenAI | 45.0 | 96.6 | 17.6 | 20.7 | 2025-08-07 |
| 🥈 2 | GPT-o3 | OpenAI | 44.8 | 92.3 | 20.2 | 21.9 | 2025-04-16 |
| 🥉 3 | Gemini-2.5-pro | Google | 43.5 | 89.1 | 19.1 | 22.2 | 2025-06-17 |
| 4 | GPT-5-mini | OpenAI | 43.5 | 96.7 | 15.7 | 18.0 | 2025-08-07 |
| 5 | GPT-oss-120b | OpenAI | 43.4 | 90.1 | 19.7 | 20.3 | 2025-08-21 |
| 6 | GPT-04-mini (high) | OpenAI | 43.0 | 87.8 | 19.8 | 21.4 | 2025-04-16 |
| 7 | Qwen3-235B-A22B-2507 (Thinking) | Alibaba | 42.3 | 84.5 | 20.0 | 22.4 | 2025-07-25 |
| 8 | Grok-4-0709 | xAI | 42.0 | 83.9 | 19.8 | 22.4 | 2025-07-09 |
| 9 | Gemini-2.5-flash | Google | 42.0 | 92.8 | 16.5 | 16.7 | 2025-06-17 |
| 10 | Qwen3-Coder-480B-A35B-Instruct | Alibaba | 41.4 | 85.3 | 18.3 | 20.5 | 2025-07-23 |
| 11 | DeepSeek-V3-0324 | DeepSeek | 41.2 | 83.7 | 19.3 | 20.5 | 2025-03-24 |
| 12 | Qwen3-235B-A22B-Instruct-2507 | Alibaba | 41.1 | 85.3 | 18.2 | 19.7 | 2025-07-21 |
| 13 | DeepSeek-V3.1 | DeepSeek | 40.9 | 83.1 | 19.3 | 20.2 | 2025-08-21 |
| 14 | Claude-Sonnet-4-20250514-Thinking | Anthropic | 40.5 | 90.3 | 14.4 | 16.9 | 2025-05-14 |
| 15 | Seed-OSS-36B-Instruct | ByteDance | 40.3 | 88.3 | 16.4 | 16.2 | 2025-08-21 |
| 16 | GLM-4.5 | Zhipu AI | 40.0 | 84.7 | 17.0 | 18.3 | 2025-07-28 |
---
# IV. Contact Us
For more details, please visit the **SKYLENAGE Platform**:
https://skylenage.alibaba-inc.com/sla/home
Contact us: **skylenage@service.alibaba.com**
[](https://skylenage.alibaba-inc.com/sla/home)
[](https://v-gamegym.github.io/index.html)
[](https://v-gamegym.github.io/leaderboard.html)
[](https://arxiv.org/abs/2509.20136)
[](https://github.com/alibaba/SKYLENAGE-GameCodeGym/)
# I. 基准测试简介
**SKYLENAGE-GameCodeGym(V-GameGym)**是一款面向代码大语言模型(Large Language Model, LLM)的综合性基准测试集,旨在弥补当前视觉游戏开发领域中代码LLM评估的空白。该数据集包含**2219个样本**,覆盖**100个聚类集群**,采用基于聚类的筛选方法以确保样本的多样性与完备性。
# II. 基准测试特性
1. **游戏专属评测指标**:涵盖可玩性、美观性与用户参与度三个维度。
2. **多模态评测能力**:支持在UI沙箱环境中由大语言模型驱动的视觉代码合成任务。
3. **有效性经过验证**:有效缩小了代码生成精度与实际开发工作流之间的性能差距。
4. **可量化评估结果**:为视觉编程任务提供可量化的评测指标。
# III. 排行榜
| 排名 | 模型名称 | 所属公司 | 总分 | 代码得分 | 截图得分 | 视频得分 | 发布日期 |
|------|---------------------------------|--------------|-------|----------|----------|----------|--------------|
| 🥇 1 | GPT-5-20250807 | OpenAI | 45.0 | 96.6 | 17.6 | 20.7 | 2025-08-07 |
| 🥈 2 | GPT-o3 | OpenAI | 44.8 | 92.3 | 20.2 | 21.9 | 2025-04-16 |
| 🥉 3 | Gemini-2.5-pro | Google | 43.5 | 89.1 | 19.1 | 22.2 | 2025-06-17 |
| 4 | GPT-5-mini | OpenAI | 43.5 | 96.7 | 15.7 | 18.0 | 2025-08-07 |
| 5 | GPT-oss-120b | OpenAI | 43.4 | 90.1 | 19.7 | 20.3 | 2025-08-21 |
| 6 | GPT-04-mini (high) | OpenAI | 43.0 | 87.8 | 19.8 | 21.4 | 2025-04-16 |
| 7 | Qwen3-235B-A22B-2507 (Thinking) | 阿里巴巴 | 42.3 | 84.5 | 20.0 | 22.4 | 2025-07-25 |
| 8 | Grok-4-0709 | xAI | 42.0 | 83.9 | 19.8 | 22.4 | 2025-07-09 |
| 9 | Gemini-2.5-flash | Google | 42.0 | 92.8 | 16.5 | 16.7 | 2025-06-17 |
| 10 | Qwen3-Coder-480B-A35B-Instruct | 阿里巴巴 | 41.4 | 85.3 | 18.3 | 20.5 | 2025-07-23 |
| 11 | DeepSeek-V3-0324 | DeepSeek | 41.2 | 83.7 | 19.3 | 20.5 | 2025-03-24 |
| 12 | Qwen3-235B-A22B-Instruct-2507 | 阿里巴巴 | 41.1 | 85.3 | 18.2 | 19.7 | 2025-07-21 |
| 13 | DeepSeek-V3.1 | DeepSeek | 40.9 | 83.1 | 19.3 | 20.2 | 2025-08-21 |
| 14 | Claude-Sonnet-4-20250514-Thinking | Anthropic | 40.5 | 90.3 | 14.4 | 16.9 | 2025-05-14 |
| 15 | Seed-OSS-36B-Instruct | 字节跳动 | 40.3 | 88.3 | 16.4 | 16.2 | 2025-08-21 |
| 16 | GLM-4.5 | 智谱AI | 40.0 | 84.7 | 17.0 | 18.3 | 2025-07-28 |
# IV. 联系我们
如需获取更多详情,请访问**SKYLENAGE平台**:
https://skylenage.alibaba-inc.com/sla/home
联系邮箱:**skylenage@service.alibaba.com**
提供机构:
maas
创建时间:
2025-09-23



